Title : An assessment of UK Economy through Payment Transactions
1.0 Dataset Source Description of the sources for the datasets used
2.0 Data Preparation Source file upload to Pandas Dataframe, cleaning, labelling, data type conversion, date formatting, imputing value, calculation of absolute values, datasets merger
3.1 Data Derivation Part I - Payment Transactions Consolidation of multiple dataframe into four main dataframe for High Value, Low Value, Cheque and Credit Card Payments
3.2 Data Derivation Part II - GDP Consolidation of multiple sources into one and conversion of different date formats into a consistent quarterly format
4.1 Exploratory Data Analysis - Payments Data Timeseries, Scatterplot and Boxplots used to check Data Quality and Characteristics
4.2 Exploratory Data Analysis - GDP Data Timeseries, Scatterplot and Boxplots used to check Data Quality and Characteristics
5.0 Construction of Model Feature selection, Feature Engineering, Correlation and Variance Inflation Analysis, model options review, model selection
5.1 Simple Linear Regression Model and Evaluation OLS Linear Regression, Model Summary, Harvey-Colliear Linearity Test, Residual Histogram plot
5.2 Generalised Least Square Method An alternative approach to evaluate the performance of OLS Model. GLS is expected to address the OLS limitations for autocorrelations with lags in the predictor variable
5.3 Multiple Linear Regression An alternative approach to evaluate the performance of OLS Model. MLR is applied to check if it is a better approach than OLS Simple Linear Regression or if multicollinearity impacts the robustness and performance of the model
6.0 Model Validation Additional methods like Sklearn Linear Regression with Cross Fold Validation and Polynomial fit modelling used for assessing OLS Simple Linear Regression Model robustness and performance. Review for overfitting/coefficient variance for different samples
6.1 SK Learn Linear Regression and Cross Fold Validation Comparison of model outputs from a different Linear Regression Method and the impact of sample size on the robustness of the model by using K Fold Cross Validation Technique
6.1 Polynomial Fit Regression Model Polynomial Fit Regression model used with various orders to assess any quadratic, cubic or higher order relationship between the predictor variables and target variable
The analysis is conducted using UK payment systems and GDP data available in the public domain as per below links. The dataset is split into three types - Bank Account Transactions , Credit Card Transaxctions and GDP
A. Bank Account Payment Transactions Timeseries for UK is available under https://www.wearepay.uk/factsandfigures/ in one file (filename : Historical-Monthly-Payment-Statistics-1990-to-Sep-2020) split into multiple tabs as per below
BACS : Monthly Timeseries available from Jan 1990- Sep 2020 includes three different types of transactions (Standing Orders, Direct Debit, Direct Credit) for volume and two features for values (All Credits = Standing Orders plus Direct Credit and Direct Debits)
CHAPS : Monthly Timeseries available from Jan 1990- Sep 2020 includes two main features for value and volume of GBP Payments Few records are also available for payments made in Euro currency from Jan 1998 to May 2008. These records were ignored for the analysis as dataset is not consistently available nor is it deemed to have a significant impact for the analysis
MonthlyCheque : Monthly Timeseries for Physical Cheques is available from Jan 1990 till Aug 2019 as the cheque processing was migrated to electronic imaging from 2019 onwards. Data is also avilable split between London and Scotland. Physical Cheque processing was migrated to electronic imaging since 2019
Credit : Giro payments data available from Jan 1990 till Aug 2019 as processing was migrated to electronic imaging from 2019 onwards
Faster Payments : Faster Payments was introduced from 2008 May. Monthly Timeseries is available since this time and it contains four different transaction volume and value related fields. three of the fields were combined to get consolidated volume and value. The Return payments value and volume was ignored for this analysis as the impact was less than 0.5%
Image Clearing : Live since 2019 and hence Monthly Timeseries is available since that date. Historic Physical Cheques and Giro transactions are now expected to be settled through this clearing
B. Credit Card Payments : Data available under https://www.ukfinance.org.uk/data-and-research/data/cards/card-spending
C. GDP - two datasets have been used for source of UK GDP data. Both a Quarterly Timeseries
https://docs.google.com/spreadsheets/d/1_rJJuTPWB-J8jkRZoeRi3De2tMLBxVcPKN2D-89lkUk/edit#gid=10 (through https://www.theguardian.com/news/datablog/2009/nov/25/gdp-uk-1948-growth-economy#data) with GDP presented as both Nominal and Inflation adjusted and ONS. Available from 1955 to 2013
https://fred.stlouisfed.org/series/RGDPMPUKQ with inflation adjusted GDP available from 1955 to 2016
A number of steps below are executed to prepare dataset for the Exploratory Data Analysis. This ensures that the datasets are created from source files after ignoring the header and footer rows with commentary/narrative, have a consistent date format across the time series as the source files do not contain the year/month combination for all rows for an automatic upload and manual assignment of column header as the source files have column header across more than one row. Steps below include
# Creating Pandas dataframe from Excel Source Files
import os
import pandas as pd
## Bank Account Payment - Excel file with multiple tabs upload for creating individual dataframes of different types of payments namely CHAPS, BACS, Cheques , Giro Payments, Faster Payments and Electonic Imaging Payments tabs
df_CHAPS = pd.read_excel('Historical-Monthly-Payment-Statistics-1990-to-Sep-2020_modified.xls', 'CHAPS', header=None, skiprows=8, skipfooter=10)
df_BACS = pd.read_excel('Historical-Monthly-Payment-Statistics-1990-to-Sep-2020_modified.xls', 'BACS', header=None, skiprows=8, skipfooter=8)
df_ChQ = pd.read_excel('Historical-Monthly-Payment-Statistics-1990-to-Sep-2020_modified.xls', 'MonthlyCheque',header=None, skiprows=8, skipfooter=10)
df_Clr = pd.read_excel('Historical-Monthly-Payment-Statistics-1990-to-Sep-2020_modified.xls', 'Credit',header=None, skiprows=8, skipfooter=8)
df_DCHQ = pd.read_excel('Historical-Monthly-Payment-Statistics-1990-to-Sep-2020_modified.xls', 'Image Clearing', header=None, skiprows=8, skipfooter=6)
df_FPS = pd.read_excel('Historical-Monthly-Payment-Statistics-1990-to-Sep-2020_modified.xls', 'Faster Payments', header=None, skiprows=8, skipfooter=4)
## Credit Card Payments - creating dataframe for the credit card payments
df_CreditCard = pd.read_excel('Total Market Credit Card Statistics - September 2020.xlsx', 'credit card data',header=None, skiprows=4, skipfooter=5)
## upload of UK GDP file from the US Federal Bank website
df_GDP_Fed = pd.read_csv('RGDPMPUKQ.csv',parse_dates=['DATE'],dayfirst=True)
##GDP file from Office of National Statistics
df_GDP_ONS = pd.read_csv('UK GDP, since 1955 GDP QUARTERLY mod.csv', parse_dates=['GrossDomesticProductinYear,quarter'])
print ("Data types for Low Value Payments", "\n", df_BACS.dtypes)
print ("Data types for Low Value Payments", "\n", df_CHAPS.dtypes)
print ("Data types for Low Value Payments", "\n", df_FPS.dtypes)
print ("Data types for Low Value Payments", "\n", df_ChQ.dtypes)
print ("Data types for Low Value Payments", "\n", df_DCHQ.dtypes)
print ("Data types for Low Value Payments", "\n", df_Clr.dtypes)
print ("Data types for Low Value Payments", "\n", df_GDP_ONS.dtypes)
print ("Data types for Low Value Payments", "\n", df_GDP_Fed.dtypes)
Data types for Low Value Payments 0 float64 1 float64 2 object 3 float64 4 float64 5 float64 6 float64 7 float64 8 float64 dtype: object Data types for Low Value Payments 0 float64 1 float64 2 object 3 float64 4 float64 5 object 6 object 7 float64 8 float64 9 float64 10 object 11 object dtype: object Data types for Low Value Payments 0 float64 1 float64 2 object 3 float64 4 float64 5 float64 6 float64 7 float64 8 float64 9 float64 10 float64 11 float64 12 float64 dtype: object Data types for Low Value Payments 0 float64 1 float64 2 object 3 float64 4 float64 5 float64 6 float64 7 float64 8 float64 9 float64 10 float64 11 float64 12 float64 13 float64 dtype: object Data types for Low Value Payments 0 float64 1 float64 2 object 3 float64 4 float64 5 float64 6 float64 7 float64 dtype: object Data types for Low Value Payments 0 float64 1 float64 2 object 3 float64 4 float64 5 float64 6 float64 7 float64 8 float64 9 float64 10 float64 11 float64 12 float64 13 float64 dtype: object Data types for Low Value Payments GrossDomesticProductinYear,quarter datetime64[ns] GrossDomesticProduct:QuarteronQuarterGrowth. float64 Chainedvolumemeasures:Seasonallyadjusted.£m int64 GDP,notadjustedforinflation£m int64 PerpersonNon-inflation float64 Perperson,inflationadjusted float64 dtype: object Data types for Low Value Payments DATE datetime64[ns] RGDPMPUKQ int64 dtype: object
## assigning datetime to the dataframe
# Adding column header and Cleaning up the uploaded files for Year and Month
import calendar
from datetime import datetime
## file header formatting and datetime conversion for CHAPS
df_CHAPS = df_CHAPS.rename(columns={0:"TimePeriod",1: "Year",2:"Month",3:"VolumeGBPCHAPS",4:"NA",5:"VolumeEURCHAPS",6:"VolumeEURCHAPSXB",7:"NA",8:"ValueGBPCHAPS",9:"NA",10:"ValueEURCHAPS",11:"ValueEURCHAPSXB"})
df_CHAPS['Year']=pd.to_datetime(df_CHAPS['Year'], format='%Y')
df_CHAPS['Year']=df_CHAPS['Year'].dt.year
df_CHAPS['Year']=df_CHAPS['Year'].fillna(method='ffill')
df_CHAPS['Day']="01"
df_CHAPS['Month']=pd.to_datetime(df_CHAPS['Month'],format='%B')
df_CHAPS['Month']=df_CHAPS['Month'].dt.month
df_CHAPS['TimePeriod'] = pd.to_datetime(dict(year=df_CHAPS['Year'],month=df_CHAPS['Month'], day=df_CHAPS['Day']),format='%Y%m%d')
## file header formatting and datetime conversion for BACS
df_BACS = df_BACS.rename(columns={0:"TimePeriod",1: "Year",2:"Month",3:"VolumeBACSSO",4:"VolumeBACSDC",5:"VolumeBACSDD",6:"NA",7:"ValueBACSSODC",8:"ValueBACSDD"})
df_BACS['Year']=pd.to_datetime(df_BACS['Year'], format='%Y')
df_BACS['Year']=df_BACS['Year'].dt.year
df_BACS['Year']=df_BACS['Year'].fillna(method='ffill')
df_BACS['Day']="01"
df_BACS['Month']=pd.to_datetime(df_BACS['Month'],format='%B')
df_BACS['Month']=df_BACS['Month'].dt.month
#df_BACS['TimePeriod'] = df_BACS.index
df_BACS['TimePeriod'] = pd.to_datetime(dict(year=df_BACS['Year'],month=df_BACS['Month'], day=df_BACS['Day']),format='%Y%m%d')
## file header formatting and datetime conversion for FPS
df_FPS = df_FPS.rename(columns={0:"TimePeriod",1: "Year",2:"Month",3:"NA", 4:"VolumeFPSSOP",5:"VolumeFPSSIP",6:"VolumeFPSFDP",7:"VolumeFPSReturns",8:"NA", 9:"ValueFPSSOP",10:"ValueFPSSIP",11:"ValueFPSFDP",12:"ValueFPSReturns"})
df_FPS['Year']=pd.to_datetime(df_FPS['Year'], format='%Y')
df_FPS['Year']=df_FPS['Year'].dt.year
df_FPS['Year']=df_FPS['Year'].fillna(method='ffill')
df_FPS['Day']="01"
df_FPS['Month']=pd.to_datetime(df_FPS['Month'],format='%B')
df_FPS['Month']=df_FPS['Month'].dt.month
df_FPS['TimePeriod'] = pd.to_datetime(dict(year=df_FPS['Year'],month=df_FPS['Month'], day=df_FPS['Day']),format='%Y%m%d')
## file header formatting and datetime conversion for Physical Cheque file
df_ChQ = df_ChQ.rename(columns={0:"TimePeriod",1: "Year",2:"Month",3:"VolChqLon", 4:"NA",5:"VolChqSct",6:"NA",7:"LdnInBr",8:"NA",9:"SctInBr",10:"",11:"ValChqLon",12:"NA", 13:"ValChqSct"})
df_ChQ['Year']=pd.to_datetime(df_ChQ['Year'], format='%Y')
df_ChQ['Year']=df_ChQ['Year'].dt.year
df_ChQ['Year']=df_ChQ['Year'].fillna(method='ffill')
df_ChQ['Day']="01"
df_ChQ['Month']=pd.to_datetime(df_ChQ['Month'],format='%B')
df_ChQ['Month']=df_ChQ['Month'].dt.month
df_ChQ['TimePeriod'] = pd.to_datetime(dict(year=df_ChQ['Year'],month=df_ChQ['Month'], day=df_ChQ['Day']),format='%Y%m%d')
## file header formatting and datetime conversion for Giro Payments
df_Clr = df_Clr.rename(columns={0:"TimePeriod",1: "Year",2:"Month",3:"VolCreLon", 4:"NA",5:"VolCreSco",6:"NA",7:"VolLdnInBr",8:"NA",9:"VolScoInBr",10:"NA",11:"ValCreLon",12:"NA", 13:"ValCreSct"})
df_Clr['Year']=pd.to_datetime(df_Clr['Year'], format='%Y')
df_Clr['Year']=df_Clr['Year'].dt.year
df_Clr['Year']=df_Clr['Year'].fillna(method='ffill')
df_Clr['Day']="01"
df_Clr['Month']=pd.to_datetime(df_Clr['Month'],format='%B')
df_Clr['Month']=df_Clr['Month'].dt.month
#df_Clr['TimePeriod'] = df_Clr.index
df_Clr['TimePeriod'] = pd.to_datetime(dict(year=df_Clr['Year'],month=df_Clr['Month'], day=df_Clr['Day']),format='%Y%m%d')
## file header formatting and datetime conversion for Digital Cheque file
df_DCHQ = df_DCHQ.rename(columns={0:"TimePeriod",1: "Year",2:"Month",3:"NA", 4:"ChqRTPVol",5:"ChqITPVol",6:"ChqRTPVal",7:"ChqITPVal"})
df_DCHQ['Year']=pd.to_datetime(df_DCHQ['Year'], format='%Y')
df_DCHQ['Year']=df_DCHQ['Year'].dt.year
df_DCHQ['Year']=df_DCHQ['Year'].fillna(method='ffill')
df_DCHQ['Day']="01"
df_DCHQ['Month']=pd.to_datetime(df_DCHQ['Month'],format='%B')
df_DCHQ['Month']=df_DCHQ['Month'].dt.month
#df_DCHQ['TimePeriod'] = df_DCHQ.index
df_DCHQ['TimePeriod'] = pd.to_datetime(dict(year=df_DCHQ['Year'],month=df_DCHQ['Month'], day=df_DCHQ['Day']),format='%Y%m%d')
## file header formatting and assigning the datetime stamp to Credit Card transactions
df_CreditCard = df_CreditCard.rename(columns={0:"TimePeriod",1: "date",2:"Credit Balance",3:"NA", 4:"Net Changes MoM",5:"NA",6:"Annualised Growth net write-offs",7:"NA",8:"Gross Lending Monthly",9:"purchases Monthly/Gross Lending",10:"of which cash withdrawals/Gross Lending",11:"Balance Transfers",12:"NA", 13:"Volume of cards transactions",14:"of purchases /total txns", 15:"of cash withdrawal/total txns", 16:"bal tfr/total txns",17:"NA",18:"Credit Card avg value of txn",19:"Credit Card avg value of cash txn",20:"avg bal of balance tfr",21:"", 22:"no. of Credit cards in use", 23:"no. of accounts", 24:"no. of active accounts Credit Cards", 25:"% of active accounts for Credit Cards", 26:"NA", 27:"% balance bearing interest", 28:"NA", 29:"purchases seasonally adjusted"})
df_CreditCard['TimePeriod']=df_CreditCard['date']
## file header and date assignment for the two GDP dataframes
df_GDP_ONS = df_GDP_ONS.rename(columns={'GrossDomesticProductinYear,quarter':"GDP:Period",
'GrossDomesticProduct:QuarteronQuarterGrowth.': "GDP:QoQGrowth",
'Chainedvolumemeasures:Seasonallyadjusted.£m':"GDPONS:InfAdj£",
'GDP,notadjustedforinflation£m':"GDPONS:Value£",
'PerpersonNon-inflation':"GDPONS:PerPerNonAdj",
'Perperson,inflationadjusted':"GDPONS:PerPerInfAdj"})
#columns = ['Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8','Unnamed: 9','Unnamed: 10']
#df_GDP_ONS.drop(columns, inplace=True, axis=1)
df_GDP_Fed = df_GDP_Fed.rename(columns={'RGDPMPUKQ':"GDPFed:UK"})
df_GDP_Fed['GDPFed:UK']=df_GDP_Fed['GDPFed:UK'].astype('float64')
df_CHAPS.head()
| TimePeriod | Year | Month | VolumeGBPCHAPS | NA | VolumeEURCHAPS | VolumeEURCHAPSXB | NA | ValueGBPCHAPS | NA | ValueEURCHAPS | ValueEURCHAPSXB | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1990-01-01 | 1990.0 | 1 | 600.0 | NaN | - | - | NaN | 1419567.0 | NaN | - | - | 01 |
| 1 | 1990-02-01 | 1990.0 | 2 | 565.0 | NaN | - | - | NaN | 1314310.0 | NaN | - | - | 01 |
| 2 | 1990-03-01 | 1990.0 | 3 | 692.0 | NaN | - | - | NaN | 1656198.0 | NaN | - | - | 01 |
| 3 | 1990-04-01 | 1990.0 | 4 | 593.0 | NaN | - | - | NaN | 1371303.0 | NaN | - | - | 01 |
| 4 | 1990-05-01 | 1990.0 | 5 | 645.0 | NaN | - | - | NaN | 1550104.0 | NaN | - | - | 01 |
df_BACS.head()
| TimePeriod | Year | Month | VolumeBACSSO | VolumeBACSDC | VolumeBACSDD | NA | ValueBACSSODC | ValueBACSDD | Day | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1990-01-01 | 1990.0 | 1 | 22170.0 | 41374.0 | 65602.0 | NaN | 32943.00000 | 19800.000000 | 01 |
| 1 | 1990-02-01 | 1990.0 | 2 | 18065.0 | 39592.0 | 58870.0 | NaN | 29595.25284 | 16556.334235 | 01 |
| 2 | 1990-03-01 | 1990.0 | 3 | 20660.0 | 43504.0 | 61565.0 | NaN | 33163.00000 | 19355.000000 | 01 |
| 3 | 1990-04-01 | 1990.0 | 4 | 18813.0 | 39950.0 | 64461.0 | NaN | 31551.00000 | 19052.000000 | 01 |
| 4 | 1990-05-01 | 1990.0 | 5 | 21862.0 | 45694.0 | 72558.0 | NaN | 36413.00000 | 20723.000000 | 01 |
df_Clr.head()
| TimePeriod | Year | Month | VolCreLon | NA | VolCreSco | NA | VolLdnInBr | NA | VolScoInBr | NA | ValCreLon | NA | ValCreSct | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1990-01-01 | 1990.0 | 1 | 14356.0 | NaN | 1232.0 | NaN | 24184.0 | NaN | 1401.0 | NaN | 9877.0 | NaN | 661.0 | 01 |
| 1 | 1990-02-01 | 1990.0 | 2 | 12302.0 | NaN | 1110.0 | NaN | 21346.0 | NaN | 1312.0 | NaN | 8363.0 | NaN | 574.0 | 01 |
| 2 | 1990-03-01 | 1990.0 | 3 | 14159.0 | NaN | 1247.0 | NaN | 23483.0 | NaN | 1410.0 | NaN | 9357.0 | NaN | 664.0 | 01 |
| 3 | 1990-04-01 | 1990.0 | 4 | 14255.0 | NaN | 1137.0 | NaN | 22522.0 | NaN | 1305.0 | NaN | 9500.0 | NaN | 661.0 | 01 |
| 4 | 1990-05-01 | 1990.0 | 5 | 15722.0 | NaN | 1286.0 | NaN | 25435.0 | NaN | 1457.0 | NaN | 9556.0 | NaN | 694.0 | 01 |
df_ChQ.head()
| TimePeriod | Year | Month | VolChqLon | NA | VolChqSct | NA | LdnInBr | NA | SctInBr | ValChqLon | NA | ValChqSct | Day | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1990-01-01 | 1990.0 | 1 | 194753.0 | NaN | 14062.0 | NaN | 52627.0 | NaN | 3436.0 | NaN | 95878.0 | NaN | 7935.0 | 01 |
| 1 | 1990-02-01 | 1990.0 | 2 | 172409.0 | NaN | 12119.0 | NaN | 46501.0 | NaN | 3113.0 | NaN | 81331.0 | NaN | 6794.0 | 01 |
| 2 | 1990-03-01 | 1990.0 | 3 | 190005.0 | NaN | 13305.0 | NaN | 51498.0 | NaN | 3472.0 | NaN | 91946.0 | NaN | 7579.0 | 01 |
| 3 | 1990-04-01 | 1990.0 | 4 | 182693.0 | NaN | 13082.0 | NaN | 49229.0 | NaN | 3396.0 | NaN | 89250.0 | NaN | 7706.0 | 01 |
| 4 | 1990-05-01 | 1990.0 | 5 | 205442.0 | NaN | 14466.0 | NaN | 55642.0 | NaN | 3712.0 | NaN | 95073.0 | NaN | 8230.0 | 01 |
df_CreditCard.head()
| TimePeriod | date | Credit Balance | NA | Net Changes MoM | NA | Annualised Growth net write-offs | NA | Gross Lending Monthly | purchases Monthly/Gross Lending | ... | no. of Credit cards in use | no. of accounts | no. of active accounts Credit Cards | % of active accounts for Credit Cards | NA | % balance bearing interest | NA | purchases seasonally adjusted | 30 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1994-01-01 | 1994-01-01 | 9071.974026 | NaN | NaN | NaN | .. | NaN | 2784.401224 | 2578.241377 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | .. | NaN |
| 1 | 1994-02-01 | 1994-02-01 | 8736.943682 | NaN | -335.030344 | NaN | .. | NaN | 2581.373618 | 2376.681752 | ... | NaN | .. | .. | .. | .. | NaN | .. | NaN | .. | NaN |
| 2 | 1994-03-01 | 1994-03-01 | 8793.653766 | NaN | 56.710084 | NaN | .. | NaN | 3132.292345 | 2889.042887 | ... | NaN | .. | .. | .. | .. | NaN | .. | NaN | .. | NaN |
| 3 | 1994-04-01 | 1994-04-01 | 9057.280165 | NaN | 263.626399 | NaN | .. | NaN | 2851.523659 | 2630.655522 | ... | NaN | .. | .. | .. | .. | NaN | .. | NaN | .. | NaN |
| 4 | 1994-05-01 | 1994-05-01 | 8785.377858 | NaN | -271.902307 | NaN | .. | NaN | 2987.547249 | 2745.238302 | ... | NaN | .. | .. | .. | .. | NaN | .. | NaN | .. | NaN |
5 rows × 31 columns
df_DCHQ.head()
| TimePeriod | Year | Month | NA | ChqRTPVol | ChqITPVol | ChqRTPVal | ChqITPVal | Day | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-01-01 | 2019.0 | 1 | NaN | 7631.344 | 203.072 | 12387.266565 | 716.750040 | 01 |
| 1 | 2019-02-01 | 2019.0 | 2 | NaN | 9023.098 | 310.591 | 15339.735420 | 758.607187 | 01 |
| 2 | 2019-03-01 | 2019.0 | 3 | NaN | 13136.729 | 558.941 | 21976.744493 | 1144.727309 | 01 |
| 3 | 2019-04-01 | 2019.0 | 4 | NaN | 15228.703 | 636.232 | 25644.323777 | 1381.089363 | 01 |
| 4 | 2019-05-01 | 2019.0 | 5 | NaN | 16132.611 | 675.983 | 26757.257887 | 1442.073442 | 01 |
df_GDP_ONS.head()
| GDP:Period | GDP:QoQGrowth | GDPONS:InfAdj£ | GDPONS:Value£ | GDPONS:PerPerNonAdj | GDPONS:PerPerInfAdj | |
|---|---|---|---|---|---|---|
| 0 | 1955-01-01 | NaN | 85937 | 4749 | 93.0 | 1688.0 |
| 1 | 1955-04-01 | 0.2 | 86091 | 4773 | 94.0 | 1690.0 |
| 2 | 1955-07-01 | 2.1 | 87877 | 4941 | 97.0 | 1723.0 |
| 3 | 1955-10-01 | -0.6 | 87384 | 5027 | 98.0 | 1711.0 |
| 4 | 1956-01-01 | 0.7 | 87990 | 5124 | 100.0 | 1721.0 |
print ("Data types for Low Value Payments", "\n", df_BACS.dtypes)
print ("Data types for Low Value Payments", "\n", df_CHAPS.dtypes)
print ("Data types for Low Value Payments", "\n", df_FPS.dtypes)
print ("Data types for Low Value Payments", "\n", df_ChQ.dtypes)
print ("Data types for Low Value Payments", "\n", df_DCHQ.dtypes)
print ("Data types for Low Value Payments", "\n", df_Clr.dtypes)
print ("Data types for Low Value Payments", "\n", df_GDP_ONS.dtypes)
print ("Data types for Low Value Payments", "\n", df_GDP_Fed.dtypes)
Data types for Low Value Payments
TimePeriod datetime64[ns]
Year float64
Month int64
VolumeBACSSO float64
VolumeBACSDC float64
VolumeBACSDD float64
NA float64
ValueBACSSODC float64
ValueBACSDD float64
Day object
dtype: object
Data types for Low Value Payments
TimePeriod datetime64[ns]
Year float64
Month int64
VolumeGBPCHAPS float64
NA float64
VolumeEURCHAPS object
VolumeEURCHAPSXB object
NA float64
ValueGBPCHAPS float64
NA float64
ValueEURCHAPS object
ValueEURCHAPSXB object
Day object
dtype: object
Data types for Low Value Payments
TimePeriod datetime64[ns]
Year float64
Month int64
NA float64
VolumeFPSSOP float64
VolumeFPSSIP float64
VolumeFPSFDP float64
VolumeFPSReturns float64
NA float64
ValueFPSSOP float64
ValueFPSSIP float64
ValueFPSFDP float64
ValueFPSReturns float64
Day object
dtype: object
Data types for Low Value Payments
TimePeriod datetime64[ns]
Year float64
Month int64
VolChqLon float64
NA float64
VolChqSct float64
NA float64
LdnInBr float64
NA float64
SctInBr float64
float64
ValChqLon float64
NA float64
ValChqSct float64
Day object
dtype: object
Data types for Low Value Payments
TimePeriod datetime64[ns]
Year float64
Month int64
NA float64
ChqRTPVol float64
ChqITPVol float64
ChqRTPVal float64
ChqITPVal float64
Day object
dtype: object
Data types for Low Value Payments
TimePeriod datetime64[ns]
Year float64
Month int64
VolCreLon float64
NA float64
VolCreSco float64
NA float64
VolLdnInBr float64
NA float64
VolScoInBr float64
NA float64
ValCreLon float64
NA float64
ValCreSct float64
Day object
dtype: object
Data types for Low Value Payments
GDP:Period datetime64[ns]
GDP:QoQGrowth float64
GDPONS:InfAdj£ int64
GDPONS:Value£ int64
GDPONS:PerPerNonAdj float64
GDPONS:PerPerInfAdj float64
dtype: object
Data types for Low Value Payments
DATE datetime64[ns]
GDPFed:UK float64
dtype: object
## conversion of volume and value of transactions into absolute numbers
### converting the CHAPS Payments column values to absolute e.g. notation 000s for volume and £m in the source file
df_CHAPS.loc[:, 'VolumeGBPCHAPS'] *=1000
df_CHAPS.loc[:,'ValueGBPCHAPS']*=1000000
df_CHAPS.loc[:,'VolumeEURCHAPS']*=1000
df_CHAPS.loc[:,'ValueEURCHAPS']*=1000000
df_CHAPS.loc[:,'VolumeEURCHAPSXB']*=1000
df_CHAPS.loc[:,'ValueEURCHAPSXB']*=1000000
df_CHAPS.loc[:,'AvgValVolCHAPS']=df_CHAPS.ValueGBPCHAPS/df_CHAPS.VolumeGBPCHAPS
### converting the BACS Payments column values to absolute e.g. notation 000s for volume and £m in the source file
df_BACS.loc[:,'VolumeBACSSO']*=1000
df_BACS.loc[:,'VolumeBACSDC']*=1000
df_BACS.loc[:,'VolumeBACSDD']*=1000
df_BACS.loc[:,'ValueBACSSODC']*=1000000
df_BACS.loc[:,'ValueBACSDD']*=1000000
### converting the FPS Payments column values to absolute e.g. notation 000s for volume and £m in the source file
df_FPS.loc[:,'VolumeFPSSOP']*=1000
df_FPS.loc[:,'VolumeFPSSIP']*=1000
df_FPS.loc[:,'VolumeFPSFDP']*=1000
df_FPS.loc[:,'VolumeFPSReturns']*=1000
df_FPS.loc[:,'ValueFPSSOP']*=1000000
df_FPS.loc[:,'ValueFPSSIP']*=1000000
df_FPS.loc[:,'ValueFPSFDP']*=1000000
df_FPS.loc[:,'ValueFPSReturns']*=1000000
###conversion of values to absolute number in £mn
df_CreditCard.loc[:,'Credit Balance']*=1000000
df_CreditCard.loc[:,'Net Changes MoM']*=1000000
df_CreditCard.loc[:,'Gross Lending Monthly']*=1000000
df_CreditCard.loc[:,'purchases Monthly/Gross Lending']*=1000000
df_CreditCard.loc[:,'of which cash withdrawals/Gross Lending']*=1000000
df_CreditCard.loc[:,'Balance Transfers']*=1000000
###Conversion of volume to absolute number in 000s
df_CreditCard.loc[:,'Volume of cards transactions']*=1000
df_CreditCard.loc[:,'of purchases /total txns']*=1000
df_CreditCard.loc[:,'of cash withdrawal/total txns']*=1000
df_CreditCard.loc[:,'bal tfr/total txns']*=1000
df_CreditCard.loc[:,'no. of Credit cards in use']*=1000
df_CreditCard.loc[:,'no. of accounts']*=1000
df_CreditCard.loc[:,'no. of active accounts Credit Cards']*=1000
df_CreditCard.loc[:,'purchases seasonally adjusted']*=1000
### conversion of the amount and volume for Physical , Giro and Digital Cheques
df_ChQ.loc[:,'VolChqLon']*=1000
df_ChQ.loc[:,'VolChqSct']*=1000
df_ChQ.loc[:,'LdnInBr']*=1000
df_ChQ.loc[:,'SctInBr']*=1000
df_ChQ.loc[:,'ValChqLon']*=1000000
df_ChQ.loc[:,'ValChqSct']*=1000000
df_DCHQ.loc[:,'ChqRTPVol']*=1000
df_DCHQ.loc[:,'ChqITPVol']*=1000
df_DCHQ.loc[:,'ChqRTPVal']*=1000000
df_DCHQ.loc[:,'ChqITPVal']*=1000000
df_Clr.loc[:,'VolCreLon']*=1000
df_Clr.loc[:,'VolCreSco']*=1000
df_Clr.loc[:,'VolLdnInBr']*=1000
df_Clr.loc[:,'VolScoInBr']*=1000
df_Clr.loc[:,'ValCreLon']*=1000000
df_Clr.loc[:,'ValCreSct']*=1000000
###conversion of GDP data into £million absolute value
df_GDP_ONS.loc[:,'GDPONS:InfAdj£']*=1000000
df_GDP_ONS.loc[:,'GDPONS:Value£']*=1000000
df_GDP_Fed.loc[:,'GDPFed:UK']*=1000000
Progressive outer joins used on all types of Payments dataframes for creating one unified payment dataframe combining all types of Bank Account Payments (CHAPS, BACS, FPS, Cheques). The resulting dataset was inner joined with Credit Card Payments to create an unified Payments dataset. Inner join was used as Credit Card data is not available prior to 1994 unlike the Bank Account Payments data. The unified dataframe was then condensed with reduced number of featuresto facilitate the exploratory data analysis for evaluating the modelling steps.
As Faster Payments was mainly introduced to faciliate same day transactions with the longer term view for replacing BACS. The dataframes for BACS and FPS have been combined to the volume and value as overall Low Value Payments. New fields were created for Transaction Value and Transaction Volume combining three types of BACS and three types of Faster Payments transactions. Also, Faster Payments has only been available since 2008 so the values prior to this date was taken as zero for combining the data frameworks
Similarly, the Physical Cheques, Giro and Digital Cheques dataframes were merged to create one dateframe with total value and volume of Cheque transactions. As the Physical Cheques and Giro data is only available till 2019 and Digital Cheque since 2019, an empty set of records were created with zero values for the period of 2019-2020 before merging the dataframe with Digital Cheques data
The dataframes were then incrementally combined (e.g. BACS with FPS, BACS+FPS with CHAPS, BACS+FPS+CHAPS with Cheques and Credit Card) to create a unified dataframe for all types of Payments with new aggregated Volume and Value fields to reduce the number of features across individual dataframes.
## Merging dataframes to reduce the number of databases
### Merging the dataframes for BACS and Faster Payment System to create one Low Value Payments dataframe and dropping the columns not required
df_BACS_FPS=pd.merge(left=df_BACS, right=df_FPS, how='left', left_on='TimePeriod', right_on='TimePeriod')
df_BACS_FPS.drop(['Year_x','Month_x','Day_x', 'Year_y','Month_y','Day_y','NA_x','NA_y'],axis=1, inplace=True)
df_BACS_FPS= df_BACS_FPS.fillna(0)
df_BACS_FPS['VolumeofallBACS']=df_BACS_FPS[['VolumeBACSSO', 'VolumeBACSDC','VolumeBACSDD']].sum(axis=1)
df_BACS_FPS['ValueofallBACS']=df_BACS_FPS[['ValueBACSSODC', 'ValueBACSDD']].sum(axis=1)
df_BACS_FPS['VolumeofallFPS']=df_BACS_FPS[['VolumeFPSSOP', 'VolumeFPSSIP','VolumeFPSFDP']].sum(axis=1)
df_BACS_FPS['ValueofallFPS']=df_BACS_FPS[['ValueFPSSOP', 'ValueFPSSIP','ValueFPSFDP']].sum(axis=1)
df_BACS_FPS['VolBACSFPS']=df_BACS_FPS[['VolumeofallBACS', 'VolumeofallFPS']].sum(axis=1)
df_BACS_FPS['ValBACSFPS']=df_BACS_FPS[['ValueofallBACS', 'ValueofallFPS']].sum(axis=1)
df_BACS_FPS.to_excel("Test_BACS_FPS.xlsx")
### Merging the BACS_FPS combined dataframe with CHAPS
df_CHAPS_BACS_FPS = pd.merge(left=df_CHAPS, right=df_BACS_FPS, how='left', left_on='TimePeriod', right_on='TimePeriod')
df_CHAPS_BACS_FPS.drop(['Year','Month','Day','VolumeEURCHAPS','VolumeEURCHAPSXB','ValueEURCHAPS', 'ValueEURCHAPSXB'],axis=1, inplace=True)
### Merging three dataframes of Physical Cheques, Giro and Digital Cheque to create one. Step includes creating empty rows in the Physical Cheques dataframe to combine the Digital Cheque through outer join
df_ChQ.drop(['Year','Month','Day','NA'],axis=1, inplace=True)
listOfSeries = [pd.Series(['2019-09-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2019-10-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2019-11-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2019-12-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2020-01-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2020-02-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2020-03-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2020-04-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2020-05-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2020-06-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2020-07-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2020-08-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns ) ,
pd.Series(['2020-09-01 00:00:00', 0, 0, 0,0,0,0,0], index=df_ChQ.columns )
]
#### Pass a list of series to the append() to add multiple rows
df_ChQ = df_ChQ.append(listOfSeries , ignore_index=True)
df_ChQ['TimePeriod'] = pd.to_datetime(df_ChQ['TimePeriod'])
df_ChQ_DCHQ = pd.merge(left=df_ChQ, right=df_DCHQ, how='left', left_on='TimePeriod', right_on='TimePeriod')
df_ChQ_Clr = pd.merge(left=df_ChQ_DCHQ, right=df_Clr, how='left', left_on='TimePeriod', right_on='TimePeriod')
df_ChQ_Clr.drop(['Year_x','Month_x','NA_x','Day_x','Year_y','Month_y','NA_y','Day_y'],axis=1, inplace=True)
df_ChQ_Clr['TotalVolChqCre'] = df_ChQ_Clr[['VolChqLon','VolChqSct','LdnInBr','SctInBr','ChqRTPVol','ChqITPVol','VolCreLon','VolCreSco','VolLdnInBr','VolScoInBr']].sum(axis=1)
df_ChQ_Clr['TotalValChqCre'] = df_ChQ_Clr[['ValChqLon','ValChqSct','ChqRTPVal','ChqITPVal','ValCreLon','ValCreSct']].sum(axis=1)
### combining BACS, FPS, CHAPS and Cheques dataframes
#df_electronic_manual = pd.merge(left=df_CHAPS_BACS_FPS, right=df_ChQ_Clr, how='left', left_on='TimePeriod', right_on='TimePeriod')
df_payments_all_interim= pd.merge(left=df_CHAPS_BACS_FPS, right=df_ChQ_Clr, how='left', left_on='TimePeriod', right_on='TimePeriod')
### combining BACS, FPS, CHAPS and Cheques dataframes and Credit Card Payments using outer join
df_payments_all = pd.merge(left=df_payments_all_interim, right=df_CreditCard, how='left', left_on='TimePeriod', right_on='TimePeriod')
#### Inner join to ensure data consistency across the timeperiod as the Credit Card Payments data is not available prior to 1994
df_payments_all_1994 = pd.merge(left=df_payments_all_interim, right=df_CreditCard, how='right', left_on='TimePeriod', right_on='TimePeriod')
df_payments_all.drop(['NA_x','NA_y','_x','_y','date'],axis=1, inplace=True)
df_payments_all_1994.drop(['NA_x','NA_y','_x','_y','date'],axis=1, inplace=True)
df_payments_condensed = df_payments_all_1994[['TimePeriod','VolumeGBPCHAPS','ValueGBPCHAPS','Gross Lending Monthly','Volume of cards transactions','TotalVolChqCre','TotalValChqCre','VolBACSFPS','ValBACSFPS']]
#print (df_payments_condensed.head(5))
#df=df_payments_condensed.rename({'VolumeGBPCHAPS':'Volume High Val Monthl}, axis=1)
df_payments_condensed = df_payments_condensed.rename(columns={'VolumeGBPCHAPS':'Volume High Val Monthly',
'ValueGBPCHAPS':'Value High Val Monthly',
'Gross Lending Monthly': 'Value Monthly Credit Card Spend',
'Volume of cards transactions': 'Volume Monthly Credit Card Spend',
'TotalVolChqCre':'Volume Cheques Monthly',
'TotalValChqCre':'Value Cheques Monthly',
'VolBACSFPS':'Volume Low Val Monthly',
'ValBACSFPS':'Value Low Val Monthly'}
)
df_payments_condensed_copy = df_payments_condensed.copy()
df_payments_condensed_copy.head()
| TimePeriod | Volume High Val Monthly | Value High Val Monthly | Value Monthly Credit Card Spend | Volume Monthly Credit Card Spend | Volume Cheques Monthly | Value Cheques Monthly | Volume Low Val Monthly | Value Low Val Monthly | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1994-01-01 | 866000.0 | 2.036697e+12 | 2.784401e+09 | 5.909620e+07 | 243230000.0 | 1.039540e+11 | 156825000.0 | 7.004333e+10 |
| 1 | 1994-02-01 | 891000.0 | 2.114562e+12 | 2.581374e+09 | 5.553233e+07 | 239655000.0 | 9.827500e+10 | 152651000.0 | 7.196926e+10 |
| 2 | 1994-03-01 | 1074000.0 | 2.352965e+12 | 3.132292e+09 | 6.611874e+07 | 285575000.0 | 1.167430e+11 | 175022000.0 | 8.596932e+10 |
| 3 | 1994-04-01 | 915000.0 | 2.003176e+12 | 2.851524e+09 | 6.125302e+07 | 246024000.0 | 1.050910e+11 | 163406000.0 | 7.183867e+10 |
| 4 | 1994-05-01 | 930000.0 | 1.982787e+12 | 2.987547e+09 | 6.297435e+07 | 258920000.0 | 1.050000e+11 | 166262000.0 | 7.460768e+10 |
Timeseries, Scatterplot and Boxplots used to check Data Quality and Characteristics
A scatter plot is used as a visual analytics method for both Transaction Volumne and Transaction Value to assess the data quality as well as to understand the relationship between the continuous variables. Initially the scatter plot showed anamolies and on investigation, issues were identified with the source data quality. BACS source file had a number formatting mask for only a certain range of values, Cheques data was available across different columns over different periods. Both these issues were corrected and the source files were re uploaded.
Scatterplots with corrected source data exhibits a positive correlation between High Value Payments , Low Value Payments and Credit Card Payments and a negative corelation with Cheques Payments. This validates the declining usage of cheques in UK over a period of time as more payments are made electronically and using cards. The data is therefore considered reliable as it reflects the known trends in the Payments world.
Yearly and Monthly Timeseries plot and boxplots are used to visually analyse the timeseries for the Transaction Volume and Value across High Value, Low Value, Cheques and Credit Card Payments. The Yearly distribution shows presence of more outliers compared to the Monthly distributions.
Visually the Yearly Boxplots indicates trend i.e. a Non Stationary timeseries i.e a the mean and variance change over period of time. The Monthly Boxplot shows seasonality
A summary statistical description of the Payment Transactions volume and value dataframe is not very useful to understand the characteristics of the timeseries data as mean and variance for a Timeseries should be analysed over a period of time to identify Stationary vs Non Stationary timeseries.
### Visual Analysis for checking the data quality and relationships for Transaction Volume
import seaborn as sns
df_payments_cod_vol = df_payments_condensed[['TimePeriod','Volume High Val Monthly','Volume Monthly Credit Card Spend','Volume Cheques Monthly','Volume Low Val Monthly']].copy().set_index(('TimePeriod'))
sns.pairplot(df_payments_cod_vol).fig.suptitle("Scatter plot of Payment Types - Volume")
df_payments_cod_vol_copy = df_payments_cod_vol.copy()
### Visual data analysis for Transaction Values
df_payments_cod_val=df_payments_condensed[['TimePeriod','Value High Val Monthly','Value Monthly Credit Card Spend','Value Cheques Monthly','Value Low Val Monthly']].copy().set_index(('TimePeriod'))
df_payments_cod_val_copy = df_payments_cod_val.copy()
sns.pairplot(df_payments_cod_val).fig.suptitle("Scatter plot of Payment Types - Value")
Text(0.5, 0.98, 'Scatter plot of Payment Types - Value')
### Descriptive Statistical Analysis
df_payments_condensed.describe()
| Volume High Val Monthly | Value High Val Monthly | Value Monthly Credit Card Spend | Volume Monthly Credit Card Spend | Volume Cheques Monthly | Value Cheques Monthly | Volume Low Val Monthly | Value Low Val Monthly | |
|---|---|---|---|---|---|---|---|---|
| count | 3.210000e+02 | 3.210000e+02 | 3.210000e+02 | 3.210000e+02 | 3.210000e+02 | 3.210000e+02 | 3.210000e+02 | 3.210000e+02 |
| mean | 2.505626e+06 | 4.921057e+12 | 1.093661e+10 | 1.685858e+08 | 1.356351e+08 | 8.140601e+10 | 4.424535e+08 | 3.075738e+11 |
| std | 8.567268e+05 | 1.500706e+12 | 3.860261e+09 | 6.002214e+07 | 8.222842e+07 | 3.502019e+10 | 1.752917e+08 | 1.566354e+11 |
| min | 8.660000e+05 | 1.953703e+12 | 2.581374e+09 | 5.553233e+07 | 1.147111e+07 | 1.481467e+10 | 1.526510e+08 | 7.004333e+10 |
| 25% | 1.862000e+06 | 4.036566e+12 | 8.282352e+09 | 1.305179e+08 | 5.568333e+07 | 4.431517e+10 | 2.796888e+08 | 1.645124e+11 |
| 50% | 2.648000e+06 | 4.956959e+12 | 1.225015e+10 | 1.649556e+08 | 1.340650e+08 | 9.948900e+10 | 4.597628e+08 | 3.019213e+11 |
| 75% | 3.018379e+06 | 5.970535e+12 | 1.333467e+10 | 2.025639e+08 | 2.136220e+08 | 1.101510e+11 | 5.687625e+08 | 4.286210e+11 |
| max | 4.420839e+06 | 8.658312e+12 | 1.857551e+10 | 3.432284e+08 | 2.855750e+08 | 1.329200e+11 | 8.179024e+08 | 6.403879e+11 |
### Timeseries plot of Payments Datasets
import matplotlib.pyplot as plt
import seaborn as sns
#sns.scatterplot(df_CHAPS['VolumeofGBPPayments'],df_CHAPS['ValueofGBPPayments'])
# The top plot consisting of High Value Payment amount
top = plt.subplot2grid((10, 10), (0,0), colspan=4)
top.plot(df_CHAPS_BACS_FPS['TimePeriod'], df_CHAPS_BACS_FPS['ValueGBPCHAPS'], label='CHAPS Value')
plt.title('CHAPS Jan 1990-Sep 2020 Payments Value')
plt.legend(loc=2)
# The bottom plot consisting of High Value Payment Volume
bottom = plt.subplot2grid((10, 10), (1,0), colspan=4)
bottom.bar(df_CHAPS_BACS_FPS['TimePeriod'], df_CHAPS_BACS_FPS['VolumeGBPCHAPS'])
plt.title('CHAPS Jan 1990- Sep 2020 Payment Volume')
#plt.gcf().set_size_inches(12, 8)
#plt.subplots_adjust(hspace=0.75)
# The top plot consisting of BACS Value Payment amount
top1 = plt.subplot2grid((10,10), (2,0), colspan=4)
top1.plot(df_CHAPS_BACS_FPS['TimePeriod'],df_CHAPS_BACS_FPS['ValueofallBACS'],label='BACS value')
plt.title('BACS Jan 1990- Sep 2020 Payments Value')
plt.legend(loc=2)
# The bottom plot consisting of BACS Payment volumne
bottom1 = plt.subplot2grid((10,10), (3,0), colspan=4)
bottom1.bar(df_CHAPS_BACS_FPS['TimePeriod'], df_CHAPS_BACS_FPS['VolumeofallBACS'])
plt.title('BACS Jan 1990- Sep 2020 Payments Volume')
# The top plot consisting of Faster Payment value
top1 = plt.subplot2grid((10,10), (4,0), colspan=4)
top1.plot(df_CHAPS_BACS_FPS['TimePeriod'],df_CHAPS_BACS_FPS['ValueofallFPS'],label='FPS value')
plt.title('Faster Payments May 2008-Sep 2020 Value')
plt.legend(loc=2)
# The bottom plot consisting of Faster Payment volume
bottom1 = plt.subplot2grid((10,10), (5,0), colspan=4)
bottom1.bar(df_CHAPS_BACS_FPS['TimePeriod'], df_CHAPS_BACS_FPS['VolumeofallFPS'])
plt.title('Faster Payments May 2008-Sep 2020 Volume')
plt.gcf().set_size_inches(48, 32)
plt.subplots_adjust(hspace=0.75)
# The top plot consisting of Credit Card value
top1 = plt.subplot2grid((10,10), (6,0), colspan=4)
top1.plot(df_CreditCard['TimePeriod'],df_CreditCard['Volume of cards transactions'],label='Credit Card Payments')
plt.title('Credit Card 1994-2020 Payment Value ')
plt.legend(loc=2)
# The bottom plot consisting of Credit cards Payment volume
bottom1 = plt.subplot2grid((10,10), (7,0), colspan=4)
bottom1.bar(df_CreditCard['TimePeriod'], df_CreditCard['Volume of cards transactions'])
plt.title('Credit Card 1994-2020 Transaction volume')
plt.gcf().set_size_inches(48, 32)
plt.subplots_adjust(hspace=0.75)
# The top plot consisting of Cheques value
top1 = plt.subplot2grid((10,10), (8,0), colspan=4)
top1.plot(df_ChQ_Clr['TimePeriod'],df_ChQ_Clr['TotalValChqCre'],label='Cheque Payments')
plt.title('Cheques 1994-2020 Payment Value ')
plt.legend(loc=2)
# The bottom plot consisting of Cheques Payment volume
bottom1 = plt.subplot2grid((10,10), (9,0), colspan=4)
bottom1.bar(df_ChQ_Clr['TimePeriod'], df_ChQ_Clr['TotalVolChqCre'])
plt.title('Cheques 1994-2020 Transaction volume')
plt.gcf().set_size_inches(48, 32)
plt.subplots_adjust(hspace=0.75)
### Descriptive Statistical Analysis of Payments Transaction Volume through a Boxplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#print (df_payments_condensed)
#fig, ax = plt.subplots(figsize=(12,5))
#seaborn.boxplot(df_payments_condensed.index.monthofyear, df_payments_condensed, ax=ax)
df_payments_condensed['year'] = df_payments_condensed['TimePeriod'].dt.year
df_payments_condensed['month'] = df_payments_condensed['TimePeriod'].dt.month
#print (df_payments_condensed.columns)
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="Volume High Val Monthly", data=df_payments_condensed, width=0.5)
plt.title('Volume of High Value Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="Volume Low Val Monthly", data=df_payments_condensed, width=0.5)
plt.title('Volume of Low Value Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="Volume Monthly Credit Card Spend", data=df_payments_condensed, width=0.5)
plt.title('Volume of Credit Card Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="Volume Cheques Monthly", data=df_payments_condensed, width=0.8)
plt.title('Volume of Cheque')
plt.show()
### Visual Analysis for checking the data quality and relationships for Transaction Value
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="Volume High Val Monthly", data=df_payments_condensed, width=0.5)
plt.title('Volume of High Value Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="Volume Low Val Monthly", data=df_payments_condensed, width=0.5)
plt.title('Volume of Low Value Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="Volume Monthly Credit Card Spend", data=df_payments_condensed, width=0.5)
plt.title('Volume of Credit Card Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="Volume Cheques Monthly", data=df_payments_condensed, width=0.5)
plt.title('Volume of Cheque')
plt.show()
### Yearly Boxplot of Payment Values to analyse the trends
df_payments_cod_val['year']=df_payments_cod_val.index.year
df_payments_cod_val['month']=df_payments_cod_val.index.month
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="Value High Val Monthly", data=df_payments_cod_val, width=0.5)
plt.title('Value of High Value Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="Value Low Val Monthly", data=df_payments_cod_val, width=0.5)
plt.title('Value of Low Value Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="Value Monthly Credit Card Spend", data=df_payments_cod_val, width=0.5)
plt.title('Value of Credit Card Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="Value Cheques Monthly", data=df_payments_cod_val, width=0.8)
plt.title('Value of Cheque')
plt.show()
## monthly boxplots for Payments Transactions Value
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="Value High Val Monthly", data=df_payments_cod_val, width=0.5)
plt.title('Value of High Value Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="Value Low Val Monthly", data=df_payments_cod_val, width=0.5)
plt.title('Value of Low Value Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="Value Monthly Credit Card Spend", data=df_payments_cod_val, width=0.5)
plt.title('Value of Credit Card Payments')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="Value Cheques Monthly", data=df_payments_cod_val, width=0.8)
plt.title('Value of Cheque')
plt.show()
###Boxplots of Predictor Variables grouped by TimePeriod
df_payment_vol_DStat=df_payments_cod_vol.copy()
df_payment_val_DStat=df_payments_cod_val.copy()
df_payment_val_DStat.drop(['month','year'],axis=1, inplace=True)
BP_Vol = df_payment_vol_DStat.groupby('TimePeriod').resample('Q-MAR', convention='end').agg('sum')
BP_Val = df_payment_val_DStat.groupby('TimePeriod').resample('Q-MAR', convention='end').agg('sum')
#print ("Boxplots of Different Payment Types by Volume and Quarters","\n")
BP_Vol_T = BP_Vol.transform(lambda x: np.log(x))
plt.figure(figsize=(15,10))
BP_Vol_T.boxplot(notch=True, vert=True)
plt.title ('BoxPlot of Volume distributions for High Value, Low Value, Cheque and Credit Card Payments ')
plt.show()
print ("Boxplots of Different Payment Types by Value and Quarters","\n")
BP_Val_T = BP_Val.transform(lambda x: np.log(x))
plt.figure(figsize=(15,10))
plt.title ('BoxPlot of Value distributions for High Value, Low Value, Cheque and Credit Card Payments ')
BP_Val_T.boxplot(notch=True, vert=True)
plt.show()
Boxplots of Different Payment Types by Value and Quarters
### Descriptive Statistics and measure of Skewness/ Kurtosis of Predictor Variables
from scipy.stats import kurtosis
from scipy.stats import skew
print ("skewness of Volume of Payment Types")
print (BP_Vol_T.skew())
print ("\n")
print ("kurtosis of Volume of Payment Types")
print (BP_Vol_T.kurtosis())
print ("\n")
print ("skewness of Value of Payment Types")
print (BP_Val_T.skew())
print ("\n")
print ("kurtosis of Value of Payment Types")
print (BP_Val_T.kurtosis())
skewness of Volume of Payment Types Volume High Val Monthly -0.764517 Volume Monthly Credit Card Spend -0.527185 Volume Cheques Monthly -0.655650 Volume Low Val Monthly -0.425063 dtype: float64 kurtosis of Volume of Payment Types Volume High Val Monthly -0.276386 Volume Monthly Credit Card Spend -0.103600 Volume Cheques Monthly -0.738224 Volume Low Val Monthly -0.990872 dtype: float64 skewness of Value of Payment Types Value High Val Monthly -0.855023 Value Monthly Credit Card Spend -1.250443 Value Cheques Monthly -0.788848 Value Low Val Monthly -0.486033 dtype: float64 kurtosis of Value of Payment Types Value High Val Monthly 0.051928 Value Monthly Credit Card Spend 0.495001 Value Cheques Monthly -0.642505 Value Low Val Monthly -0.879604 dtype: float64
A single dataset was created to combine the GDP information available from two different sources. The quarterly series from these two sources have two different date formats (YYYY-Qq vs DD/MM/YYYY). The date format has been standardised and also converted to end of period/quarter for consistency
#print (df_GDP_Fed)
#print (df_GDP_ONS)
##merging the two GDP dataframes
#print (df_GDP_Fed)
#print (df_GDP_ONS)
df_GDP_ONS['TimePeriod1'] = df_GDP_ONS['GDP:Period'].dt.to_period("Q").dt.end_time
df_GDP_ONS['TimePeriod']=df_GDP_ONS['TimePeriod1'].dt.strftime('%Y-%m-%d')
df_GDP_ONS['TimePeriod']=pd.to_datetime(df_GDP_ONS['TimePeriod'])
df_GDP_Fed['Qdate'] = [date - pd.tseries.offsets.DateOffset(days=1) + pd.tseries.offsets.QuarterEnd() for date in df_GDP_Fed.DATE]
df_GDP_Combined = pd.merge(left=df_GDP_Fed, right=df_GDP_ONS, how='left', left_on='Qdate', right_on='TimePeriod')
df_GDP_Combined.drop(['GDP:Period','GDP:QoQGrowth','GDPONS:PerPerNonAdj','GDPONS:PerPerInfAdj'],axis=1,inplace=True)
#print (df_GDP_Combined)
Timeseries plot is analysed alongside Year and Quarter Boxplots for the Nominal and Inflation adjusted GDP. The Timeseries plot indicates that the mean is trending up while the Boxplots show a right skewed distribition for Inflation Adjusted GDPs from ONS and FED sources while the Nominal GDP appears to have a better alignment to normal distribution
### Timeseries plot for GDP data sourced from different sites
plt.plot(df_GDP_Combined['Qdate'],df_GDP_Combined['GDPONS:InfAdj£'],label="ONS Inflation Adjusted GDP, £mm")
plt.plot(df_GDP_Combined['Qdate'],df_GDP_Combined['GDPONS:Value£'], label="ONS Nominal GDP, £mm")
plt.plot(df_GDP_Combined['Qdate'],df_GDP_Combined['GDPFed:UK'], label ="Fed UK GDP, £mm")
plt.legend(loc="upper left")
plt.xlabel('TimePeriod')
plt.ylabel('GDP')
plt.show()
df_GDP_Combined['year'] = df_GDP_Combined['Qdate'].dt.year
df_GDP_Combined['month'] = df_GDP_Combined['Qdate'].dt.month
range = (df_GDP_Combined['Qdate'] > '1994-1-1')
df_GDP_Combined_1994onwards = df_GDP_Combined.loc[range]
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="GDPONS:InfAdj£", data=df_GDP_Combined_1994onwards, width=0.5)
plt.title('GDP Inflation Adjusted , ONS')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="GDPONS:Value£", data=df_GDP_Combined_1994onwards, width=0.5)
plt.title('GDP Nominal , ONS')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="year", y="GDPFed:UK", data=df_GDP_Combined_1994onwards, width=0.5)
plt.title('GDP Inflation Adjusted ,Fed ')
plt.show()
#print (df_GDP_Combined_1994onwards)
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="GDPONS:InfAdj£", data=df_GDP_Combined_1994onwards, width=0.5)
plt.title('GDP Inflation Adjusted , ONS')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="GDPONS:Value£", data=df_GDP_Combined_1994onwards, width=0.5, notch=True)
plt.title('GDP Nominal , ONS')
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(x="month", y="GDPFed:UK", data=df_GDP_Combined_1994onwards, width=0.5)
plt.title('GDP Inflation Adjusted ,Fed ')
plt.show()
###Descriptive Statistics of GDP
GDP_DStat = df_GDP_Combined_1994onwards.copy().set_index('Qdate')
#print (GDP_DStat)
GDP_DStat.drop(['DATE','TimePeriod1','TimePeriod','year','month'],axis=1,inplace=True)
print ("skewness of GDP parameters","\n")
print (GDP_DStat.skew())
print ("\n")
print ("kurtosis of GDP parameters","\n")
print (GDP_DStat.kurtosis())
skewness of GDP parameters GDPFed:UK -0.390448 GDPONS:InfAdj£ -0.591080 GDPONS:Value£ -0.087247 dtype: float64 kurtosis of GDP parameters GDPFed:UK -1.044042 GDPONS:InfAdj£ -0.966489 GDPONS:Value£ -1.386822 dtype: float64
Correlation analysis is performed on the unified dataset created by merging the Target and Predictor variables datasets. A very high degree of positive correlation is observed between the Electronic Payment methods (High Value, Low Value and Cards Payments) and a negative correlation is observed between Cheque Payments and Electronic Payment methods. Variance Inflation Factor used to identify MultiColinearity Issue for Multiple Linear Regression Modelling
New Features (Average Transaction Value = Value of Payment / Volume of Payment) are created as predictor variables to combine the inpact of transactionality in the economy and the magnitude of the transactions.
Pairgrid plots was used on these new feature variables and two types of target varibales (Nominal GDP and Inflation Adjusted GDP) to visually assess the distribution of individual paraemeters and correlation of feature variable amonsgt themselves as well as to target variable. Whilst a clear relationship can be identified between the Target Variable and some of the feature variables (Average Credit Card Transaction amount to GDP, Average Low Value transaction amount to GDP) there is also a non monotonic relationship highlighted between GDP and Average High Value Transaction amount
Regression and Residual plots are used to identify the best feature variable for the Regression Model. Patterns are observed in the residual plots for the feature variable to target variable. These patterns are less pronounced for GDP-Average High Value Transaction Amount compared to GDP-Average Credit Card and GDP-Low Value Transaction Amount.
Visual methods applied to both GDP Nominal and GDP Inflation Adjusted Target to assess major differences
### Feature selection
df_payments_cod_val_Q = df_payments_cod_val_copy.copy()
df_payments_cod_vol_Q = df_payments_cod_vol_copy.copy()
df_payments_cod_val_Q=df_payments_cod_val_Q.resample('Q').sum()
df_payments_cod_vol_Q=df_payments_cod_vol_Q.resample('Q').sum()
merged_payments_Q = df_payments_cod_vol_Q.merge(df_payments_cod_val_Q, left_index=True, right_index=True)
merged_payments_Q_copy = merged_payments_Q.copy()
print (merged_payments_Q.columns)
plt.figure(figsize=(15,10))
sns.heatmap(merged_payments_Q.corr(),annot=True)
plt.show()
Index(['Volume High Val Monthly', 'Volume Monthly Credit Card Spend',
'Volume Cheques Monthly', 'Volume Low Val Monthly',
'Value High Val Monthly', 'Value Monthly Credit Card Spend',
'Value Cheques Monthly', 'Value Low Val Monthly'],
dtype='object')
df_payments_cod_val_Q.head()
| Value High Val Monthly | Value Monthly Credit Card Spend | Value Cheques Monthly | Value Low Val Monthly | |
|---|---|---|---|---|
| TimePeriod | ||||
| 1994-03-31 | 6.504224e+12 | 8.498067e+09 | 3.189720e+11 | 2.279819e+11 |
| 1994-06-30 | 6.073072e+12 | 9.100763e+09 | 3.188590e+11 | 2.253897e+11 |
| 1994-09-30 | 6.157786e+12 | 9.643722e+09 | 3.192700e+11 | 2.400360e+11 |
| 1994-12-31 | 6.317689e+12 | 9.889370e+09 | 3.227930e+11 | 2.474340e+11 |
| 1995-03-31 | 7.044896e+12 | 9.631356e+09 | 3.171290e+11 | 2.482133e+11 |
merged_payments_Q.head()
| Volume High Val Monthly | Volume Monthly Credit Card Spend | Volume Cheques Monthly | Volume Low Val Monthly | Value High Val Monthly | Value Monthly Credit Card Spend | Value Cheques Monthly | Value Low Val Monthly | |
|---|---|---|---|---|---|---|---|---|
| TimePeriod | ||||||||
| 1994-03-31 | 2831000.0 | 1.807473e+08 | 768460000.0 | 484498000.0 | 6.504224e+12 | 8.498067e+09 | 3.189720e+11 | 2.279819e+11 |
| 1994-06-30 | 2851000.0 | 1.932671e+08 | 775734000.0 | 508503000.0 | 6.073072e+12 | 9.100763e+09 | 3.188590e+11 | 2.253897e+11 |
| 1994-09-30 | 2992000.0 | 2.023186e+08 | 777188000.0 | 529650000.0 | 6.157786e+12 | 9.643722e+09 | 3.192700e+11 | 2.400360e+11 |
| 1994-12-31 | 2978000.0 | 2.172931e+08 | 800344000.0 | 535093000.0 | 6.317689e+12 | 9.889370e+09 | 3.227930e+11 | 2.474340e+11 |
| 1995-03-31 | 3066000.0 | 1.993873e+08 | 743806000.0 | 534649980.0 | 7.044896e+12 | 9.631356e+09 | 3.171290e+11 | 2.482133e+11 |
### Feature Engineering - new field added for average value of transactions
df_GDP_Combined_1994onwards_mod= df_GDP_Combined_1994onwards.copy().set_index(('TimePeriod'))
df_payments_condensed_mod = df_payments_condensed.copy().set_index(('TimePeriod'))
df_payments_condensed_mod= df_payments_condensed_mod.resample('Q', convention='end').agg('sum')
merged_df_modelling = df_GDP_Combined_1994onwards_mod.merge(df_payments_condensed_mod, left_index=True, right_index=True)
merged_df_modelling.drop(['DATE','year_x','month_x','Qdate','year_y','month_y','TimePeriod1'], axis=1, inplace=True)
merged_df_modelling['AvgLowValTxn']=merged_df_modelling['Value Low Val Monthly']/merged_df_modelling['Volume Low Val Monthly']
merged_df_modelling['AvgCreditCardTxn']=merged_df_modelling['Value Monthly Credit Card Spend']/merged_df_modelling['Volume Monthly Credit Card Spend']
merged_df_modelling['AvgHighValTxn']=merged_df_modelling['Value High Val Monthly']/merged_df_modelling['Volume High Val Monthly']
### Linear Regression using single predictor - average High Value Transaction
### Visual Analysis of the variance and distribution of original and synthetic predictor variables
feature_test = merged_df_modelling.copy()
feature_test.drop(['GDPFed:UK','GDPONS:InfAdj£','Volume High Val Monthly', 'Value High Val Monthly',
'Value Monthly Credit Card Spend','Volume Monthly Credit Card Spend',
'Volume Cheques Monthly','Value Cheques Monthly',
'Volume Low Val Monthly','Value Low Val Monthly'],
axis=1,inplace=True)
feature_test.head()
| GDPONS:Value£ | AvgLowValTxn | AvgCreditCardTxn | AvgHighValTxn | |
|---|---|---|---|---|
| TimePeriod | ||||
| 1994-03-31 | 1.715770e+11 | 470.552851 | 47.016298 | 2.297501e+06 |
| 1994-06-30 | 1.733510e+11 | 443.241576 | 47.089058 | 2.130155e+06 |
| 1994-09-30 | 1.757000e+11 | 453.197340 | 47.666008 | 2.058084e+06 |
| 1994-12-31 | 1.799410e+11 | 462.413078 | 45.511664 | 2.121454e+06 |
| 1995-03-31 | 1.819560e+11 | 464.253851 | 48.304753 | 2.297748e+06 |
plt.figure(figsize=(15,10))
g_features = sns.PairGrid(feature_test, diag_sharey=False, corner=True)
g_features.map_lower(sns.scatterplot).fig.suptitle("Pairplot of Predictor Variables with Target Variable")
g_features.map_diag(sns.kdeplot)
plt.show()
<Figure size 1080x720 with 0 Axes>
plt.figure(figsize=(15,10))
sns.heatmap(feature_test.corr(),annot=True)
plt.title("Correlation Plot of Predictor Variables with Target Variable")
plt.show()
### Variance Inflation Factor
merged_df_modelling_feature_test = merged_df_modelling.copy()
merged_df_modelling_feature_test.drop(['GDPFed:UK','GDPONS:InfAdj£','GDPONS:Value£'],axis=1,inplace=True)
vifs = pd.Series(np.linalg.inv(merged_payments_Q.corr().to_numpy()).diagonal(),
index=merged_payments_Q.columns,
name='VIF')
print (vifs)
#print (merged_df_modelling)
#print(merged_df_modelling_feature_test)
vifs_mod = pd.Series(np.linalg.inv(feature_test.corr().to_numpy()).diagonal(),
index=feature_test.columns,
name='VIF')
print ("Variable Inflation Factor of the Predictor Variables " , "\n", "\n", vifs_mod)
Volume High Val Monthly 87.940564 Volume Monthly Credit Card Spend 63.172924 Volume Cheques Monthly 176.451349 Volume Low Val Monthly 133.331131 Value High Val Monthly 22.098095 Value Monthly Credit Card Spend 51.658481 Value Cheques Monthly 78.793053 Value Low Val Monthly 398.961054 Name: VIF, dtype: float64 Variable Inflation Factor of the Predictor Variables GDPONS:Value£ 15.381788 AvgLowValTxn 15.390285 AvgCreditCardTxn 3.354909 AvgHighValTxn 1.940842 Name: VIF, dtype: float64
###visual regression plots to select the predictor variable with Nominal GDP as Target Variable
sns.regplot(x="AvgHighValTxn", y="GDPONS:Value£", data=feature_test)
plt.title("Regplot High Value Payments vs GDP")
plt.show()
sns.regplot(x="AvgCreditCardTxn", y="GDPONS:Value£", data=feature_test)
plt.title("Regplot Credit Card Payments vs GDP")
plt.show()
sns.regplot(x="AvgLowValTxn", y="GDPONS:Value£", data=feature_test)
plt.title("Regplot Low Value Payments vs GDP")
plt.show()
###residual plots with Nominal GDP as the Target Variable to check if the Simple Regression Model is fit for purpose
sns.residplot(x="AvgHighValTxn", y="GDPONS:Value£", data=feature_test,scatter_kws={"s": 90})
plt.title("Residual plot of High Value Payments vs GDP")
plt.show()
sns.residplot(x="AvgCreditCardTxn", y="GDPONS:Value£", data=feature_test, scatter_kws={"s": 80})
plt.title("Residual plot of Credit Card Payments vs GDP")
plt.show()
sns.residplot(x="AvgLowValTxn", y="GDPONS:Value£", data=feature_test, scatter_kws={"s": 80})
plt.title("Residual plot of Low Value Payments vs GDP")
plt.show()
###Feature Test Two using the Inflation Adjusted GDP for the Target Variable
feature_test_two = merged_df_modelling.copy()
feature_test_two.drop(['GDPFed:UK','GDPONS:Value£'],axis=1,inplace=True)
sns.regplot(x="AvgHighValTxn", y="GDPONS:InfAdj£", data=feature_test_two)
plt.title("Regplot High Value Payments vs Inflation Adjusted GDP")
plt.show()
sns.regplot(x="AvgCreditCardTxn", y="GDPONS:InfAdj£", data=feature_test_two)
plt.title("Regplot Credit Card Payments vs Inflation Adjusted GDP")
plt.show()
sns.regplot(x="AvgLowValTxn", y="GDPONS:InfAdj£", data=feature_test_two)
plt.title("Regplot Low Value Payments vs Inflation Adjusted GDP")
plt.show()
###Residual plot for Inflation Adjusted GDP as Target Variable
sns.residplot(x="AvgHighValTxn", y="GDPONS:InfAdj£", data=feature_test_two)
plt.title("Residual plot High Value Payments vs Inflation Adjusted GDP")
plt.show()
sns.residplot(x="AvgCreditCardTxn", y="GDPONS:InfAdj£", data=feature_test_two)
plt.title("Residual plot Credit Card Payments vs Inflation Adjusted GDP")
plt.show()
sns.residplot(x="AvgLowValTxn", y="GDPONS:InfAdj£", data=feature_test_two)
plt.title("Residual plot Low Value Payments vs Inflation Adjusted GDP")
plt.show()
Ordinary Least Square , Statmodel Linear Regression method is used for modelling GDP from Low Value Payments as the Predictor variable. Comparison is done for Simple Linear Regression Model using the Average Payment Amount feature of High Value and Credit Card Payments.
Each model run has model summary and residual plot to review the outcome and evaluate model chracteristics.
Model Evaluation separately done using Harvey-Collier Test and Histogram plot of Residuals
###Simple Regression Model using one Predictor Variable - Average Low Value Transaction
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import scipy.stats as stats
X_train, X_test, y_train, y_test = train_test_split(feature_test['AvgLowValTxn'], feature_test['GDPONS:Value£'], test_size=0.33, random_state=22)
#print (y_test.values)
X = sm.add_constant(X_train)
modelLV = sm.OLS(y_train,X)
resultsLV = modelLV.fit()
X1 = sm.add_constant(X_test)
y_pred = resultsLV.predict(X1)
y1_pred = y_pred.values
y1_test = y_test.values
y_test = y_test.to_frame()
y_pred = y_pred.to_frame()
#y_testLV = y_testLV.to_frame()
#y_predLV = y_predLV.to_frame()
#print (X1.shape)
#print (y_test.shape)
#print (y_pred.shape)
df_model_LV_temp1 = pd.merge(X1, y_test, left_index=True, right_index=True)
df_model_LV_temp2 = pd.merge(X1, y_pred, left_index=True, right_index=True)
df_model_LV_temp2=df_model_LV_temp2.rename(columns={0:"GDP:PredictedLV"})
df_model_LV = pd.merge(df_model_LV_temp1, df_model_LV_temp2, left_index=True, right_index=True)
df_model_LV['ResidualLV'] = df_model_LV['GDPONS:Value£']-df_model_LV['GDP:PredictedLV']
#sns.scatterplot(x='AvgLowValTxn_x',y='GDPONS:Value£', data = df_model_LV)
#plt.plot(df_model_LV['AvgLowValTxn_x'],df_model_LV['GDP:PredictedLV'])
#plt.show()
#sns.scatterplot(x='GDPONS:Value£',y='ResidualLV', data = df_model_LV)
res_modelLV = resultsLV.resid
fig_model2_LV =sm.qqplot(res_modelLV, stats.t, fit = True, line ="45")
plt.title("QQ Residual plot of predicted GDP based on Low Value Payments ")
plt.show()
print(resultsLV.summary())
OLS Regression Results
==============================================================================
Dep. Variable: GDPONS:Value£ R-squared: 0.917
Model: OLS Adj. R-squared: 0.916
Method: Least Squares F-statistic: 554.3
Date: Sun, 20 Dec 2020 Prob (F-statistic): 1.03e-28
Time: 19:01:08 Log-Likelihood: -1308.1
No. Observations: 52 AIC: 2620.
Df Residuals: 50 BIC: 2624.
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const -2.139e+11 2.16e+10 -9.922 0.000 -2.57e+11 -1.71e+11
AvgLowValTxn 8.211e+08 3.49e+07 23.544 0.000 7.51e+08 8.91e+08
==============================================================================
Omnibus: 4.952 Durbin-Watson: 1.790
Prob(Omnibus): 0.084 Jarque-Bera (JB): 2.200
Skew: -0.159 Prob(JB): 0.333
Kurtosis: 2.044 Cond. No. 4.63e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.63e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
df_model_LV[['ResidualLV','GDP:PredictedLV']]
#df_model_LV.to_excel('res results Excel.xlsx')
| ResidualLV | GDP:PredictedLV | |
|---|---|---|
| TimePeriod | ||
| 2008-09-30 | 1.432876e+10 | 3.491582e+11 |
| 2000-09-30 | -5.130613e+09 | 2.528736e+11 |
| 2007-09-30 | 2.640858e+10 | 3.346814e+11 |
| 1999-09-30 | -1.827265e+10 | 2.543756e+11 |
| 1998-03-31 | -3.472658e+10 | 2.538846e+11 |
| 2009-09-30 | 2.796017e+10 | 3.279998e+11 |
| 2010-12-31 | 2.996343e+10 | 3.453526e+11 |
| 1996-09-30 | -3.724740e+09 | 2.025657e+11 |
| 2005-12-31 | 3.221023e+10 | 2.941988e+11 |
| 2010-06-30 | -1.164318e+09 | 3.702773e+11 |
| 2002-06-30 | -4.464085e+10 | 3.125389e+11 |
| 2007-03-31 | 1.410868e+10 | 3.344493e+11 |
| 2001-03-31 | -4.074735e+10 | 2.942323e+11 |
| 2012-03-31 | -1.908931e+10 | 4.085973e+11 |
| 2001-12-31 | -2.628759e+10 | 2.875486e+11 |
| 1997-12-31 | -1.370808e+10 | 2.298121e+11 |
| 1995-09-30 | 1.754559e+10 | 1.692104e+11 |
| 2000-03-31 | -2.642672e+10 | 2.712467e+11 |
| 2004-12-31 | 1.953533e+10 | 2.895807e+11 |
| 2009-03-31 | 6.499944e+09 | 3.447231e+11 |
| 1994-03-31 | -8.822834e+08 | 1.724593e+11 |
| 2011-12-31 | -5.560310e+09 | 3.928273e+11 |
| 2013-06-30 | -3.023361e+09 | 4.028574e+11 |
| 2003-06-30 | -2.587509e+10 | 3.104831e+11 |
| 1994-12-31 | 1.416535e+10 | 1.657757e+11 |
| 2006-12-31 | 3.425784e+10 | 3.102192e+11 |
To test if the feature variable and target variable have the desired attributes for Simple Linear Regression
To check overfitting or underfitting issues
To check the normality and variance of residual distribution
### Harvey-Collier Test
import statsmodels.stats.api as sms
sms.linear_harvey_collier(resultsLV)
Ttest_1sampResult(statistic=0.26035282277318245, pvalue=0.7957052734823411)
### Observed vs Predicted values
plt.figure(figsize=(10,5))
X_plot_GDP = [df_model_LV['GDPONS:Value£'].min(), df_model_LV['GDPONS:Value£'].max()]
ax = sns.scatterplot(x="GDPONS:Value£", y="GDP:PredictedLV", data=df_model_LV)
ax.set(xlabel='GDP Actual', ylabel='GDP Predicted')
plt.plot(X_plot_GDP, X_plot_GDP, color='r')
plt.show()
###histogram of residuals
plt.figure(figsize=(10,5))
sns.distplot(res_modelLV, kde=True)
plt.title("OLS histogram distribution of Residuals with Bimodality")
plt.show()
### Residuals vs Fitted
plt.figure(figsize=(10,5))
sns.scatterplot(x='GDP:PredictedLV',y='ResidualLV', data = df_model_LV)
plt.show()
C:\Users\aruna\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
### Heteroscedascity Check for Avg Low Value Payment Amount
#sm.graphics.plot_fit(resultsHV, 'AvgHighValTxn')
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(resultsLV, 'AvgLowValTxn', fig=fig)
#sm.graphics.plot_regress_exog(resultsLV, 'GDPONS:Value£', fig=fig)
plt.show()
plt.rc("figure", figsize=(16,8))
plt.figure(figsize=(10,10))
sm.graphics.influence_plot(resultsLV, size=4)
plt.show()
<Figure size 720x720 with 0 Axes>
Alternative approach to OLS used to review the robustness of Timeseries modelling by addressing the autocorrelation aspect of predictor and target variables
Model run with summary and residual plot to compare the corresponding results from OLS Simple Regression results
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import scipy.stats as stats
#X_train, X_test, y_train, y_test = train_test_split(feature_test['AvgLowValTxn'], feature_test['GDPONS:Value£'], test_size=0.30, random_state=10)
X_LV = feature_test['AvgLowValTxn'].values
y_LV = feature_test['GDPONS:Value£'].values
glsar_model_LV = sm.GLSAR(y_LV, X_LV, 1)
glsar_results_LV = glsar_model_LV.iterative_fit(1)
print(glsar_results_LV.summary())
res_modelLV_gls = glsar_results_LV.resid
fig_model2_LV =sm.qqplot(res_modelLV_gls, stats.t, fit = True, line ="45")
plt.title("GLS QQ Plot of Residuals")
plt.show()
sns.distplot(res_modelLV_gls,kde=True)
plt.title("GLS Residuals Histogram for Low Value Payments")
plt.show()
GLSAR Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.986
Model: GLSAR Adj. R-squared (uncentered): 0.986
Method: Least Squares F-statistic: 5374.
Date: Sun, 20 Dec 2020 Prob (F-statistic): 2.83e-72
Time: 19:01:10 Log-Likelihood: -1979.8
No. Observations: 77 AIC: 3962.
Df Residuals: 76 BIC: 3964.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 4.798e+08 6.55e+06 73.305 0.000 4.67e+08 4.93e+08
==============================================================================
Omnibus: 1639.723 Durbin-Watson: 0.045
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9.899
Skew: 0.086 Prob(JB): 0.00709
Kurtosis: 1.252 Cond. No. 1.00
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\aruna\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
###Simple Regression Model using one Predictor Variable - Average Credit Card Transaction
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import scipy.stats as stats
X_train, X_test, y_train, y_test = train_test_split(feature_test['AvgCreditCardTxn'], feature_test['GDPONS:Value£'], test_size=0.30, random_state=42)
#print (y_test.values)
X = sm.add_constant(X_train)
modelCC = sm.OLS(y_train,X)
resultsCC = modelCC.fit()
X1 = sm.add_constant(X_test)
y_pred = resultsCC.predict(X1)
y1_pred = y_pred.values
y1_test = y_test.values
y_test = y_test.to_frame()
y_pred = y_pred.to_frame()
#y_testCC = y_testCC.to_frame()
#y_predCC = y_predCC.to_frame()
#print (X1.shape)
#print (y_test.shape)
#print (y_pred.shape)
df_model_CC_temp1 = pd.merge(X1, y_test, left_index=True, right_index=True)
df_model_CC_temp2 = pd.merge(X1, y_pred, left_index=True, right_index=True)
df_model_CC_temp2=df_model_CC_temp2.rename(columns={0:"GDP:PredictedCC"})
df_model_CC = pd.merge(df_model_CC_temp1, df_model_CC_temp2, left_index=True, right_index=True)
df_model_CC['ResidualCC'] = df_model_CC['GDPONS:Value£']-df_model_CC['GDP:PredictedCC']
#sns.scatterplot(x='AvgCreditCardTxn_x',y='GDPONS:Value£', data = df_model_CC)
#plt.plot(df_model_CC['AvgCreditCardTxn_x'],df_model_CC['GDP:PredictedCC'])
#plt.show()
#sns.scatterplot(x='GDPONS:Value£',y='ResidualCC', data = df_model_CC)
res_modelCC = resultsCC.resid
fig_model2_CC =sm.qqplot(res_modelCC, stats.t, fit = True, line ="45")
plt.title("Residual Plot for Credit Card Payments as Predictors in OLS for GDP")
plt.show()
print(resultsCC.summary())
OLS Regression Results
==============================================================================
Dep. Variable: GDPONS:Value£ R-squared: 0.575
Model: OLS Adj. R-squared: 0.566
Method: Least Squares F-statistic: 70.25
Date: Sun, 20 Dec 2020 Prob (F-statistic): 3.19e-11
Time: 19:01:11 Log-Likelihood: -1402.4
No. Observations: 54 AIC: 2809.
Df Residuals: 52 BIC: 2813.
Df Model: 1
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const -4.123e+10 4.04e+10 -1.020 0.312 -1.22e+11 3.99e+10
AvgCreditCardTxn 5.049e+09 6.02e+08 8.381 0.000 3.84e+09 6.26e+09
==============================================================================
Omnibus: 4.411 Durbin-Watson: 2.023
Prob(Omnibus): 0.110 Jarque-Bera (JB): 4.287
Skew: 0.674 Prob(JB): 0.117
Kurtosis: 2.702 Cond. No. 426.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
### Heteroscedascity Check for Avg Card Card Payment Amount
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(resultsCC, 'AvgCreditCardTxn', fig=fig)
plt.show()
sm.graphics.influence_plot(resultsCC, size=6)
plt.show()
Model run with summary and residual plot to compare the corresponding results from OLS Simple Regression results
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import scipy.stats as stats
#X_train, X_test, y_train, y_test = train_test_split(feature_test['AvgLowValTxn'], feature_test['GDPONS:Value£'], test_size=0.30, random_state=10)
X_CC = feature_test['AvgCreditCardTxn'].values
y_CC = feature_test['GDPONS:Value£'].values
glsar_model_CC = sm.GLSAR(y_CC, X_CC, 1)
glsar_results_CC = glsar_model_CC.iterative_fit(1)
print(glsar_results_CC.summary())
res_modelCC_gls = glsar_results_CC.resid
fig_model2_CC =sm.qqplot(res_modelCC_gls, stats.t, fit = True, line ="45")
plt.title("Residual Plot for Credit Card Payments as Predictor in GLS for GDP")
plt.show()
GLSAR Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.977
Model: GLSAR Adj. R-squared (uncentered): 0.977
Method: Least Squares F-statistic: 3278.
Date: Sun, 20 Dec 2020 Prob (F-statistic): 2.91e-64
Time: 19:01:13 Log-Likelihood: -1998.5
No. Observations: 77 AIC: 3999.
Df Residuals: 76 BIC: 4001.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 4.448e+09 7.77e+07 57.254 0.000 4.29e+09 4.6e+09
==============================================================================
Omnibus: 9.086 Durbin-Watson: 0.126
Prob(Omnibus): 0.011 Jarque-Bera (JB): 9.855
Skew: 0.855 Prob(JB): 0.00724
Kurtosis: 2.617 Cond. No. 1.00
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
###Simple Regression Model using one Predictor Variable - High Value Transaction Amount
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import scipy.stats as stats
X_train, X_test, y_train, y_test = train_test_split(feature_test['AvgHighValTxn'], feature_test['GDPONS:Value£'], test_size=0.30, random_state=42)
#print (y_test.values)
X = sm.add_constant(X_train)
modelHV = sm.OLS(y_train,X)
resultsHV = modelHV.fit()
X1 = sm.add_constant(X_test)
y_pred = resultsHV.predict(X1)
y1_pred = y_pred.values
y1_test = y_test.values
y_test = y_test.to_frame()
y_pred = y_pred.to_frame()
#print (X1.shape)
#print (y_test.shape)
#print (y_pred.shape)
df_model_HVP_temp1 = pd.merge(X1, y_test, left_index=True, right_index=True)
df_model_HVP_temp2 = pd.merge(X1, y_pred, left_index=True, right_index=True)
df_model_HVP_temp2=df_model_HVP_temp2.rename(columns={0:"GDP:Predicted"})
df_model_HVP = pd.merge(df_model_HVP_temp1, df_model_HVP_temp2, left_index=True, right_index=True)
df_model_HVP['Residual'] = df_model_HVP['GDPONS:Value£']-df_model_HVP['GDP:Predicted']
#sns.scatterplot(x='AvgHighValTxn_x',y='GDPONS:Value£', data = df_model_HVP)
#plt.plot(df_model_HVP['AvgHighValTxn_x'],df_model_HVP['GDP:Predicted'])
#plt.show()
#sns.scatterplot(x='GDPONS:Value£',y='Residual', data = df_model_HVP)
res_model = resultsHV.resid
fig_model2 =sm.qqplot(res_model, stats.t, fit = True, line ="45")
plt.title("Residual Plot for Low and High Value Payments as Predictors in OLS for GDP")
plt.show()
print(resultsHV.summary())
OLS Regression Results
==============================================================================
Dep. Variable: GDPONS:Value£ R-squared: 0.174
Model: OLS Adj. R-squared: 0.159
Method: Least Squares F-statistic: 10.98
Date: Sun, 20 Dec 2020 Prob (F-statistic): 0.00168
Time: 19:01:13 Log-Likelihood: -1420.3
No. Observations: 54 AIC: 2845.
Df Residuals: 52 BIC: 2849.
Df Model: 1
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const 6.226e+11 9.97e+10 6.242 0.000 4.22e+11 8.23e+11
AvgHighValTxn -1.609e+05 4.86e+04 -3.314 0.002 -2.58e+05 -6.35e+04
==============================================================================
Omnibus: 1.750 Durbin-Watson: 2.060
Prob(Omnibus): 0.417 Jarque-Bera (JB): 1.247
Skew: 0.104 Prob(JB): 0.536
Kurtosis: 2.285 Cond. No. 2.31e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.31e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
### Heteroscedascity Check for Avg High Value Payment Amount
#sm.graphics.plot_fit(resultsHV, 'AvgHighValTxn')
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(resultsHV, 'AvgHighValTxn', fig=fig)
plt.show()
sm.graphics.influence_plot(resultsHV, size=4)
plt.show()
Model run with summary and residual plot to compare the corresponding results from OLS Simple Regression results
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import scipy.stats as stats
#X_train, X_test, y_train, y_test = train_test_split(feature_test['AvgLowValTxn'], feature_test['GDPONS:Value£'], test_size=0.30, random_state=10)
X_HV = feature_test['AvgHighValTxn'].values
GDP_HV = feature_test['GDPONS:Value£'].values
glsar_model_HV = sm.GLSAR(GDP_HV, X_HV, 1)
glsar_results_HV = glsar_model_HV.iterative_fit(1)
print(glsar_results_HV.summary())
res_modelHV_gls = glsar_results_HV.resid
fig_model2_HV =sm.qqplot(res_modelHV_gls, stats.t, fit = True, line ="45")
plt.title("Residual Plot for High Value Payments as Predictor in GLS for GDP")
plt.show()
sns.displot(res_modelHV_gls, kde=True)
GLSAR Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.920
Model: GLSAR Adj. R-squared (uncentered): 0.919
Method: Least Squares F-statistic: 879.7
Date: Sun, 20 Dec 2020 Prob (F-statistic): 1.57e-43
Time: 19:01:15 Log-Likelihood: -2046.9
No. Observations: 77 AIC: 4096.
Df Residuals: 76 BIC: 4098.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 1.407e+05 4744.727 29.659 0.000 1.31e+05 1.5e+05
==============================================================================
Omnibus: 198.045 Durbin-Watson: 0.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8.251
Skew: -0.160 Prob(JB): 0.0162
Kurtosis: 1.428 Cond. No. 1.00
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<seaborn.axisgrid.FacetGrid at 0x18a896e9490>
#print (df_model_LV)
print (feature_test['AvgLowValTxn'])
TimePeriod
1994-03-31 470.552851
1994-06-30 443.241576
1994-09-30 453.197340
1994-12-31 462.413078
1995-03-31 464.253851
...
2012-06-30 744.301300
2012-09-30 730.184292
2012-12-31 713.020438
2013-03-31 746.143117
2013-06-30 751.147093
Name: AvgLowValTxn, Length: 78, dtype: float64
Evaluation of the OLS Simple Regression Model appproach with different combinations of feature variables in a Multi Linear Regression Model
Modelling based on all three feature/predictor variables and assessment of the summary and residuals for comparision of Robustness and Residual distribution with the OLS Simple Linear Regression outputs
###Multilinear Regression
Target = "GDPONS:Value£"
feature_test = feature_test.dropna(axis=0)
X = feature_test[['AvgLowValTxn','AvgCreditCardTxn', 'AvgHighValTxn']]
y = feature_test['GDPONS:Value£']
#print (X.isnull)
X = sm.add_constant(X)
model_multi = sm.OLS(y, X)
results_multi = model_multi.fit()
results_multi.params
print (results_multi.summary())
OLS Regression Results
==============================================================================
Dep. Variable: GDPONS:Value£ R-squared: 0.935
Model: OLS Adj. R-squared: 0.932
Method: Least Squares F-statistic: 354.8
Date: Sun, 20 Dec 2020 Prob (F-statistic): 8.08e-44
Time: 19:01:16 Log-Likelihood: -1953.3
No. Observations: 78 AIC: 3915.
Df Residuals: 74 BIC: 3924.
Df Model: 3
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const -4.428e+10 4.06e+10 -1.092 0.279 -1.25e+11 3.65e+10
AvgLowValTxn 7.874e+08 4.15e+07 18.985 0.000 7.05e+08 8.7e+08
AvgCreditCardTxn -1.315e+08 3.69e+08 -0.357 0.722 -8.66e+08 6.03e+08
AvgHighValTxn -6.875e+04 1.46e+04 -4.718 0.000 -9.78e+04 -3.97e+04
==============================================================================
Omnibus: 3.179 Durbin-Watson: 0.344
Prob(Omnibus): 0.204 Jarque-Bera (JB): 2.598
Skew: -0.325 Prob(JB): 0.273
Kurtosis: 2.386 Cond. No. 3.94e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.94e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
###GLS Regression
X_GLS_3 = feature_test[['AvgLowValTxn','AvgCreditCardTxn', 'AvgHighValTxn']]
y_GLS_3 = feature_test['GDPONS:Value£']
#print (X.isnull)
#X = sm.add_constant(X)
glsar_model_MLR = sm.GLSAR(y_GLS_3, X_GLS_3, 1)
glsar_results_MLR = glsar_model_MLR.iterative_fit(1)
print(glsar_results_MLR.summary())
GLSAR Regression Results
=======================================================================================
Dep. Variable: GDPONS:Value£ R-squared (uncentered): 0.996
Model: GLSAR Adj. R-squared (uncentered): 0.996
Method: Least Squares F-statistic: 6550.
Date: Sun, 20 Dec 2020 Prob (F-statistic): 1.22e-89
Time: 19:01:16 Log-Likelihood: -1929.3
No. Observations: 77 AIC: 3865.
Df Residuals: 74 BIC: 3872.
Df Model: 3
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
AvgLowValTxn 7.867e+08 4.18e+07 18.817 0.000 7.03e+08 8.7e+08
AvgCreditCardTxn -3.225e+08 3.25e+08 -0.993 0.324 -9.69e+08 3.24e+08
AvgHighValTxn -8.404e+04 5934.772 -14.161 0.000 -9.59e+04 -7.22e+04
==============================================================================
Omnibus: 2.445 Durbin-Watson: 0.348
Prob(Omnibus): 0.294 Jarque-Bera (JB): 1.874
Skew: -0.217 Prob(JB): 0.392
Kurtosis: 2.371 Cond. No. 3.13e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.13e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
### Residual plot for Multi Linear Regression with all three feature variables
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(results_multi, 'AvgLowValTxn', fig=fig)
plt.show()
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(results_multi, 'AvgHighValTxn', fig=fig)
plt.show()
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(results_multi, 'AvgCreditCardTxn', fig=fig)
plt.show()
sm.graphics.influence_plot(results_multi, size=4)
plt.show()
Modelling based on all two feature/predictor variables and assessment of the summary and residuals for comparision of Robustness and Residual distribution with the OLS Simple Linear Regression outputs
###multiple linear regression
Target = "GDPONS:Value£"
feature_test = feature_test.dropna(axis=0)
X_2 = feature_test[['AvgLowValTxn','AvgHighValTxn']]
y_2 = feature_test['GDPONS:Value£']
#print (X.isnull)
#X_2 = sm.add_constant(X_2)
model_multi_2 = sm.OLS(y_2, X_2)
results_multi_2 = model_multi_2.fit()
results_multi_2.params
print (results_multi_2.summary())
OLS Regression Results
=======================================================================================
Dep. Variable: GDPONS:Value£ R-squared (uncentered): 0.996
Model: OLS Adj. R-squared (uncentered): 0.996
Method: Least Squares F-statistic: 9964.
Date: Sun, 20 Dec 2020 Prob (F-statistic): 1.07e-92
Time: 19:01:19 Log-Likelihood: -1954.5
No. Observations: 78 AIC: 3913.
Df Residuals: 76 BIC: 3918.
Df Model: 2
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
AvgLowValTxn 7.478e+08 1.89e+07 39.660 0.000 7.1e+08 7.85e+08
AvgHighValTxn -8.26e+04 5698.359 -14.496 0.000 -9.4e+04 -7.13e+04
==============================================================================
Omnibus: 1.773 Durbin-Watson: 0.322
Prob(Omnibus): 0.412 Jarque-Bera (JB): 1.674
Skew: -0.259 Prob(JB): 0.433
Kurtosis: 2.502 Cond. No. 1.83e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.83e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
###GLS Regression Two Variable - LV and HV
###GLS Regression
X_GLS_2 = feature_test[['AvgLowValTxn', 'AvgHighValTxn']]
y_GLS_2 = feature_test['GDPONS:Value£']
#print (X.isnull)
#X = sm.add_constant(X)
glsar_model_MLR_2 = sm.GLSAR(y_GLS_2, X_GLS_2, 1)
glsar_results_MLR_2 = glsar_model_MLR_2.iterative_fit(1)
print(glsar_results_MLR_2.summary())
GLSAR Regression Results
=======================================================================================
Dep. Variable: GDPONS:Value£ R-squared (uncentered): 0.996
Model: GLSAR Adj. R-squared (uncentered): 0.996
Method: Least Squares F-statistic: 9827.
Date: Sun, 20 Dec 2020 Prob (F-statistic): 1.77e-91
Time: 19:01:19 Log-Likelihood: -1929.8
No. Observations: 77 AIC: 3864.
Df Residuals: 75 BIC: 3868.
Df Model: 2
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
AvgLowValTxn 7.499e+08 1.94e+07 38.652 0.000 7.11e+08 7.89e+08
AvgHighValTxn -8.332e+04 5889.889 -14.147 0.000 -9.51e+04 -7.16e+04
==============================================================================
Omnibus: 1.735 Durbin-Watson: 0.323
Prob(Omnibus): 0.420 Jarque-Bera (JB): 1.617
Skew: -0.246 Prob(JB): 0.446
Kurtosis: 2.488 Cond. No. 1.86e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.86e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import scipy.stats as stats
#X_train, X_test, y_train, y_test = train_test_split(feature_test['AvgLowValTxn'], feature_test['GDPONS:Value£'], test_size=0.30, random_state=10)
X_2 = feature_test[['AvgLowValTxn','AvgHighValTxn']]
y_2 = feature_test['GDPONS:Value£']
#X_LV = feature_test['AvgLowValTxn'].values
#y_LV = feature_test['GDPONS:Value£'].values
glsar_model_LV = sm.GLSAR(y_LV, X_LV, 1)
glsar_results_LV = glsar_model_LV.iterative_fit(1)
print(glsar_results_LV.summary())
res_modelLV_gls = glsar_results_LV.resid
fig_model2_LV =sm.qqplot(res_modelLV_gls, stats.t, fit = True, line ="45")
plt.title("Residual Plot for Low and High Value Payments as Predictors in GLS for GDP")
plt.show()
GLSAR Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.986
Model: GLSAR Adj. R-squared (uncentered): 0.986
Method: Least Squares F-statistic: 5374.
Date: Sun, 20 Dec 2020 Prob (F-statistic): 2.83e-72
Time: 19:01:19 Log-Likelihood: -1979.8
No. Observations: 77 AIC: 3962.
Df Residuals: 76 BIC: 3964.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 4.798e+08 6.55e+06 73.305 0.000 4.67e+08 4.93e+08
==============================================================================
Omnibus: 1639.723 Durbin-Watson: 0.045
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9.899
Skew: 0.086 Prob(JB): 0.00709
Kurtosis: 1.252 Cond. No. 1.00
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
### Residual plot for Multi Linear Regression with two feature variables
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(results_multi_2, 'AvgLowValTxn', fig=fig)
plt.show()
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(results_multi_2, 'AvgHighValTxn', fig=fig)
plt.show()
sm.graphics.influence_plot(results_multi_2, size=4)
plt.show()
Modelling based on all two feature/predictor variables and assessment of the summary and residuals for comparision of Robustness and Residual distribution with the OLS Simple Linear Regression outputs
###multiple linear regression
Target = "GDPONS:Value£"
feature_test = feature_test.dropna(axis=0)
X_3 = feature_test[['AvgCreditCardTxn','AvgHighValTxn']]
y_3 = feature_test['GDPONS:Value£']
#print (X.isnull)
#X_2 = sm.add_constant(X_2)
model_multi_3 = sm.OLS(y_3, X_3)
results_multi_3 = model_multi_3.fit()
results_multi_3.params
print (results_multi_3.summary())
OLS Regression Results
=======================================================================================
Dep. Variable: GDPONS:Value£ R-squared (uncentered): 0.978
Model: OLS Adj. R-squared (uncentered): 0.978
Method: Least Squares F-statistic: 1717.
Date: Sun, 20 Dec 2020 Prob (F-statistic): 5.66e-64
Time: 19:01:22 Log-Likelihood: -2022.4
No. Observations: 78 AIC: 4049.
Df Residuals: 76 BIC: 4053.
Df Model: 2
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
AvgCreditCardTxn 5.118e+09 3.5e+08 14.604 0.000 4.42e+09 5.82e+09
AvgHighValTxn -2.241e+04 1.14e+04 -1.971 0.052 -4.51e+04 229.640
==============================================================================
Omnibus: 6.912 Durbin-Watson: 0.156
Prob(Omnibus): 0.032 Jarque-Bera (JB): 6.949
Skew: 0.731 Prob(JB): 0.0310
Kurtosis: 2.966 Cond. No. 1.42e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.42e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
### Residual plots for model assessment
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(results_multi_3, 'AvgCreditCardTxn', fig=fig)
plt.show()
fig = plt.figure(figsize=(13,9))
sm.graphics.plot_regress_exog(results_multi_3, 'AvgHighValTxn', fig=fig)
plt.show()
sm.graphics.influence_plot(results_multi_3, size=4)
plt.show()
###GLS for HV and Credit Card Payments
###GLS Regression
X_GLS_22 = feature_test[['AvgCreditCardTxn', 'AvgHighValTxn']]
y_GLS_22 = feature_test['GDPONS:Value£']
#print (X.isnull)
#X = sm.add_constant(X)
glsar_model_MLR_22 = sm.GLSAR(y_GLS_22, X_GLS_22, 1)
glsar_results_MLR_22 = glsar_model_MLR_22.iterative_fit(1)
print(glsar_results_MLR_22.summary())
X_GLS_23 = feature_test[['AvgCreditCardTxn', 'AvgLowValTxn']]
y_GLS_23 = feature_test['GDPONS:Value£']
#print (X.isnull)
#X = sm.add_constant(X)
glsar_model_MLR_23 = sm.GLSAR(y_GLS_23, X_GLS_23, 1)
glsar_results_MLR_23 = glsar_model_MLR_23.iterative_fit(1)
print(glsar_results_MLR_23.summary())
GLSAR Regression Results
=======================================================================================
Dep. Variable: GDPONS:Value£ R-squared (uncentered): 0.978
Model: GLSAR Adj. R-squared (uncentered): 0.978
Method: Least Squares F-statistic: 1691.
Date: Sun, 20 Dec 2020 Prob (F-statistic): 4.15e-63
Time: 19:01:25 Log-Likelihood: -1996.9
No. Observations: 77 AIC: 3998.
Df Residuals: 75 BIC: 4002.
Df Model: 2
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
AvgCreditCardTxn 5.088e+09 3.6e+08 14.137 0.000 4.37e+09 5.81e+09
AvgHighValTxn -2.135e+04 1.17e+04 -1.819 0.073 -4.47e+04 2026.853
==============================================================================
Omnibus: 6.700 Durbin-Watson: 0.155
Prob(Omnibus): 0.035 Jarque-Bera (JB): 6.772
Skew: 0.725 Prob(JB): 0.0338
Kurtosis: 2.915 Cond. No. 1.44e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.44e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
GLSAR Regression Results
=======================================================================================
Dep. Variable: GDPONS:Value£ R-squared (uncentered): 0.986
Model: GLSAR Adj. R-squared (uncentered): 0.986
Method: Least Squares F-statistic: 2657.
Date: Sun, 20 Dec 2020 Prob (F-statistic): 2.42e-70
Time: 19:01:25 Log-Likelihood: -1979.8
No. Observations: 77 AIC: 3964.
Df Residuals: 75 BIC: 3968.
Df Model: 2
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
AvgCreditCardTxn 2.386e+08 6.16e+08 0.387 0.700 -9.89e+08 1.47e+09
AvgLowValTxn 4.543e+08 6.62e+07 6.863 0.000 3.22e+08 5.86e+08
==============================================================================
Omnibus: 2001.636 Durbin-Watson: 0.046
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9.788
Skew: 0.127 Prob(JB): 0.00749
Kurtosis: 1.272 Cond. No. 94.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In adition to the Model Evaluation Techniques of prior section (Study of Linearity, Coeffiecients and Residuals from OLS Simple Linear Regression and comparision with Generalised Linear Regression and Multiple Linear Regression model summary, additional methods like Sklearn Linear Regression with Cross Fold Validation and Polynomial fit modelling used for assessing OLS Simple Linear Regression Model robustness and performance
Use of an alternate Linear Regression Method to compare the coefficient from OLS Simple Linear Regression
Use of Cross Fold Validation to check overfitting / underfitting issues as well as variance of the coefficients for random sample sizes
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
#X=feature_test[['AvgLowValTxn']].values
#y=feature_test['GDPONS:Value£'].values
## train-test 75-25 split
feature_train1, feature_test1 = train_test_split(feature_test,
train_size = 0.67,
test_size = 0.33,
random_state = 80)
#print (feature_test1)
#print (feature_train1)
#print ()
# divide into X_train, y_train, X_test, y_test
GDP_train_f = feature_train1['GDPONS:Value£']
LV_train_f = feature_train1[['AvgLowValTxn']]
GDP_test_f = feature_test1['GDPONS:Value£']
LV_test_f = feature_test1[['AvgLowValTxn']]
#sns.regplot(x="AvgLowValTxn", y="GDPONS:Value£", data=feature_test, fit_reg=False)
lm_model = LinearRegression()
lm_result = lm_model.fit(LV_train_f,GDP_train_f)
GDP_pred_f = lm_result.predict(LV_test_f)
#print (cross_val_score(LinearRegression(), X, y, scoring='r2', cv=5))
r2_lm = sklearn.metrics.r2_score(GDP_test_f, GDP_pred_f)
print (r2_lm)
0.9331425966170507
print(feature_test.shape)
(78, 4)
### Cross Fold Validation
# create a KFold object with 5 splits
GDP_folds = KFold(n_splits = 5, shuffle = True, random_state = 80)
scores = cross_val_score(lm_model, LV_train_f, GDP_train_f, scoring='r2', cv=GDP_folds)
print (scores)
print(len(LV_train_f))
[0.82798157 0.83643402 0.91394902 0.93447525 0.86199382] 52
Polynomial fit used for checking if second or higher order fit address the residuals distribution issue identified from the OLS simple Linear Regression modelling. However, results dont indicate a non-linear relationship between Predictor and Target Variables
### trying a polynomial fit to validate if the simple regression model fit is robust
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.sandbox.regression.predstd import wls_prediction_std
poly_feature = PolynomialFeatures(degree=3)
feature_poly_test = feature_test.copy()
ft_GDP = feature_poly_test['GDPONS:Value£']
ft_LV = feature_poly_test[['AvgLowValTxn']]
poly_LV = poly_feature.fit_transform(ft_LV)
model_poly_GDP = sm.OLS(ft_GDP, poly_LV).fit()
pred_poly_GDP = model_poly_GDP.predict(poly_LV)
plt.scatter(ft_LV, ft_GDP)
plt.plot(ft_LV, pred_poly_GDP)
plt.title("3 order Polynomial fit for Low Value Payments as Predictor and GDP as Target variable")
plt.xlabel('Avg Low Value Payment')
plt.ylabel('GDP, £mm')
plt.show()
model_poly_GDP.summary()
| Dep. Variable: | GDPONS:Value£ | R-squared: | 0.934 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.931 |
| Method: | Least Squares | F-statistic: | 347.8 |
| Date: | Sun, 20 Dec 2020 | Prob (F-statistic): | 1.60e-43 |
| Time: | 19:01:25 | Log-Likelihood: | -1954.1 |
| No. Observations: | 78 | AIC: | 3916. |
| Df Residuals: | 74 | BIC: | 3926. |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 3.524e+12 | 7.78e+11 | 4.529 | 0.000 | 1.97e+12 | 5.07e+12 |
| x1 | -1.793e+10 | 3.96e+09 | -4.529 | 0.000 | -2.58e+10 | -1e+10 |
| x2 | 3.085e+07 | 6.62e+06 | 4.657 | 0.000 | 1.77e+07 | 4.4e+07 |
| x3 | -1.668e+04 | 3649.791 | -4.570 | 0.000 | -2.4e+04 | -9407.635 |
| Omnibus: | 3.026 | Durbin-Watson: | 0.447 |
|---|---|---|---|
| Prob(Omnibus): | 0.220 | Jarque-Bera (JB): | 2.275 |
| Skew: | -0.372 | Prob(JB): | 0.321 |
| Kurtosis: | 3.382 | Cond. No. | 9.55e+10 |
### trying a polynomial fit to validate if the simple regression model fit is robust
from sklearn.preprocessing import PolynomialFeatures
poly_feature = PolynomialFeatures(degree=5)
feature_poly_test = feature_test.copy()
ft_GDP = feature_poly_test['GDPONS:Value£']
ft_LV = feature_poly_test[['AvgLowValTxn']]
poly_LV = poly_feature.fit_transform(ft_LV)
model_poly_GDP = sm.OLS(ft_GDP, poly_LV).fit()
pred_poly_GDP = model_poly_GDP.predict(poly_LV)
plt.scatter(ft_LV, ft_GDP)
plt.plot(ft_LV, pred_poly_GDP)
plt.title("5th order Polynomial fit for Low Value Payments as Predictor and GDP as Target variable")
plt.show()
model_poly_GDP.summary()
| Dep. Variable: | GDPONS:Value£ | R-squared: | 0.937 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.933 |
| Method: | Least Squares | F-statistic: | 271.2 |
| Date: | Sun, 20 Dec 2020 | Prob (F-statistic): | 5.44e-43 |
| Time: | 19:01:25 | Log-Likelihood: | -1952.2 |
| No. Observations: | 78 | AIC: | 3914. |
| Df Residuals: | 73 | BIC: | 3926. |
| Df Model: | 4 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -6.128e+07 | 8.95e+07 | -0.685 | 0.496 | -2.4e+08 | 1.17e+08 |
| x1 | -7.196e+09 | 1.05e+10 | -0.685 | 0.496 | -2.81e+10 | 1.37e+10 |
| x2 | 6.085e+07 | 7.1e+07 | 0.857 | 0.394 | -8.07e+07 | 2.02e+08 |
| x3 | -1.79e+05 | 1.78e+05 | -1.003 | 0.319 | -5.35e+05 | 1.77e+05 |
| x4 | 228.3815 | 197.437 | 1.157 | 0.251 | -165.110 | 621.873 |
| x5 | -0.1061 | 0.081 | -1.306 | 0.196 | -0.268 | 0.056 |
| Omnibus: | 2.164 | Durbin-Watson: | 0.453 |
|---|---|---|---|
| Prob(Omnibus): | 0.339 | Jarque-Bera (JB): | 1.640 |
| Skew: | -0.068 | Prob(JB): | 0.440 |
| Kurtosis: | 3.697 | Cond. No. | 2.53e+18 |
### trying a polynomial fit to validate if the simple regression model fit is robust
from sklearn.preprocessing import PolynomialFeatures
poly_feature = PolynomialFeatures(degree=8)
feature_poly_test = feature_test.copy()
ft_GDP = feature_poly_test['GDPONS:Value£']
ft_LV = feature_poly_test[['AvgLowValTxn']]
poly_LV = poly_feature.fit_transform(ft_LV)
model_poly_GDP = sm.OLS(ft_GDP, poly_LV).fit()
pred_poly_GDP = model_poly_GDP.predict(poly_LV)
plt.scatter(ft_LV, ft_GDP)
plt.plot(ft_LV, pred_poly_GDP)
plt.title("8th order Polynomial fit for Low Value Payments as Predictor and GDP as Target variable")
plt.show()
model_poly_GDP.summary()
| Dep. Variable: | GDPONS:Value£ | R-squared: | 0.933 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.931 |
| Method: | Least Squares | F-statistic: | 346.1 |
| Date: | Sun, 20 Dec 2020 | Prob (F-statistic): | 1.89e-43 |
| Time: | 19:01:26 | Log-Likelihood: | -1954.2 |
| No. Observations: | 78 | AIC: | 3916. |
| Df Residuals: | 74 | BIC: | 3926. |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 9.05e-14 | 9.93e-15 | 9.113 | 0.000 | 7.07e-14 | 1.1e-13 |
| x1 | -1.503e-07 | 1.65e-08 | -9.113 | 0.000 | -1.83e-07 | -1.17e-07 |
| x2 | 1.152e-08 | 1.26e-09 | 9.113 | 0.000 | 9e-09 | 1.4e-08 |
| x3 | 3.394e-06 | 3.72e-07 | 9.113 | 0.000 | 2.65e-06 | 4.14e-06 |
| x4 | 0.0008 | 9e-05 | 9.113 | 0.000 | 0.001 | 0.001 |
| x5 | 0.1249 | 0.014 | 9.113 | 0.000 | 0.098 | 0.152 |
| x6 | -0.0005 | 6.51e-05 | -8.017 | 0.000 | -0.001 | -0.000 |
| x7 | 7.558e-07 | 1.02e-07 | 7.380 | 0.000 | 5.52e-07 | 9.6e-07 |
| x8 | -3.719e-10 | 5.33e-11 | -6.974 | 0.000 | -4.78e-10 | -2.66e-10 |
| Omnibus: | 1.209 | Durbin-Watson: | 0.482 |
|---|---|---|---|
| Prob(Omnibus): | 0.546 | Jarque-Bera (JB): | 0.686 |
| Skew: | 0.191 | Prob(JB): | 0.710 |
| Kurtosis: | 3.254 | Cond. No. | 3.32e+29 |
### Polynomial fit for Credit Cards Payment
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.sandbox.regression.predstd import wls_prediction_std
poly_feature = PolynomialFeatures(degree=5)
feature_poly_testCC = feature_test.copy()
ft_GDP = feature_poly_testCC['GDPONS:Value£']
ft_CC = feature_poly_testCC[['AvgCreditCardTxn']]
poly_CC = poly_feature.fit_transform(ft_CC)
model_poly_GDP = sm.OLS(ft_GDP, poly_CC).fit()
pred_poly_GDP = model_poly_GDP.predict(poly_CC)
plt.scatter(ft_CC, ft_GDP)
plt.plot(ft_CC, pred_poly_GDP)
plt.title("5th order Polynomial fit for Credit Card Payments as Predictor and GDP as Target variable")
plt.xlabel('Avg credit Card Payment')
plt.ylabel('GDP, £mm')
plt.show()
model_poly_GDP.summary()
| Dep. Variable: | GDPONS:Value£ | R-squared: | 0.751 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.733 |
| Method: | Least Squares | F-statistic: | 43.35 |
| Date: | Sun, 20 Dec 2020 | Prob (F-statistic): | 2.17e-20 |
| Time: | 19:01:26 | Log-Likelihood: | -2005.8 |
| No. Observations: | 78 | AIC: | 4024. |
| Df Residuals: | 72 | BIC: | 4038. |
| Df Model: | 5 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -8.331e+13 | 4.87e+13 | -1.710 | 0.092 | -1.8e+14 | 1.38e+13 |
| x1 | 6.892e+12 | 3.91e+12 | 1.762 | 0.082 | -9.06e+11 | 1.47e+13 |
| x2 | -2.25e+11 | 1.24e+11 | -1.808 | 0.075 | -4.73e+11 | 2.31e+10 |
| x3 | 3.625e+09 | 1.96e+09 | 1.848 | 0.069 | -2.85e+08 | 7.53e+09 |
| x4 | -2.879e+07 | 1.53e+07 | -1.879 | 0.064 | -5.93e+07 | 1.75e+06 |
| x5 | 9.017e+04 | 4.75e+04 | 1.899 | 0.062 | -4475.302 | 1.85e+05 |
| Omnibus: | 2.403 | Durbin-Watson: | 0.391 |
|---|---|---|---|
| Prob(Omnibus): | 0.301 | Jarque-Bera (JB): | 1.687 |
| Skew: | -0.264 | Prob(JB): | 0.430 |
| Kurtosis: | 3.489 | Cond. No. | 2.11e+13 |
### Polynomial fit for High Value Payment
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.sandbox.regression.predstd import wls_prediction_std
poly_feature = PolynomialFeatures(degree=5)
feature_poly_testCC = feature_test.copy()
ft_GDP = feature_poly_testCC['GDPONS:Value£']
ft_CC = feature_poly_testCC[['AvgHighValTxn']]
poly_CC = poly_feature.fit_transform(ft_CC)
model_poly_GDP = sm.OLS(ft_GDP, poly_CC).fit()
pred_poly_GDP = model_poly_GDP.predict(poly_CC)
plt.scatter(ft_CC, ft_GDP)
plt.plot(ft_CC, pred_poly_GDP)
plt.title("5th order Polynomial fit for High Value Payments as Predictor and GDP as Target variable")
plt.xlabel('Avg credit Card Payment')
plt.ylabel('GDP, £mm')
plt.show()
model_poly_GDP.summary()
| Dep. Variable: | GDPONS:Value£ | R-squared: | 0.231 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.210 |
| Method: | Least Squares | F-statistic: | 11.26 |
| Date: | Sun, 20 Dec 2020 | Prob (F-statistic): | 5.27e-05 |
| Time: | 19:01:26 | Log-Likelihood: | -2049.7 |
| No. Observations: | 78 | AIC: | 4105. |
| Df Residuals: | 75 | BIC: | 4112. |
| Df Model: | 2 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 5.525e-25 | 1.45e-25 | 3.806 | 0.000 | 2.63e-25 | 8.42e-25 |
| x1 | -3.51e-09 | 9.22e-10 | -3.806 | 0.000 | -5.35e-09 | -1.67e-09 |
| x2 | 6.846e-13 | 1.8e-13 | 3.806 | 0.000 | 3.26e-13 | 1.04e-12 |
| x3 | 4.655e-07 | 1.22e-07 | 3.806 | 0.000 | 2.22e-07 | 7.09e-07 |
| x4 | -3.454e-13 | 1.18e-13 | -2.939 | 0.004 | -5.8e-13 | -1.11e-13 |
| x5 | 6.585e-20 | 2.81e-20 | 2.343 | 0.022 | 9.87e-21 | 1.22e-19 |
| Omnibus: | 0.989 | Durbin-Watson: | 0.111 |
|---|---|---|---|
| Prob(Omnibus): | 0.610 | Jarque-Bera (JB): | 0.925 |
| Skew: | 0.039 | Prob(JB): | 0.630 |
| Kurtosis: | 2.472 | Cond. No. | 1.01e+37 |
Polynomial fit used on multiple Predictor Varibales (Low Value, High Value and Credit Card Payments) for checking if second or higher order fit address the residuals distribution issue identified from the OLS simple Linear Regression modelling. However, results dont indicate a non-linear relationship between Predictor and Target Variables
### Polynomial fit for multivariate
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.sandbox.regression.predstd import wls_prediction_std
poly_feature = PolynomialFeatures(degree=5)
feature_poly_testMR = feature_test.copy()
ft_GDP_MR = feature_poly_testMR['GDPONS:Value£']
ft_MR = feature_poly_testMR[['AvgLowValTxn','AvgCreditCardTxn', 'AvgHighValTxn']]
poly_MR = poly_feature.fit_transform(ft_MR)
model_poly_GDP_MR = sm.OLS(ft_GDP, poly_MR).fit()
pred_poly_GDP = model_poly_GDP_MR.predict(poly_MR)
#plt.scatter(ft_MR, ft_GDP)
plt.scatter(ft_GDP_MR, pred_poly_GDP)
plt.title("5th order Polynomial fit for MultiRegression with 3 Predictors and GDP as Target variable")
plt.xlabel('Avg credit Card Payment, LV and HV Payments')
plt.ylabel('GDP, £mm')
plt.show()
model_poly_GDP_MR.summary()
| Dep. Variable: | GDPONS:Value£ | R-squared: | 0.949 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.942 |
| Method: | Least Squares | F-statistic: | 140.3 |
| Date: | Sun, 20 Dec 2020 | Prob (F-statistic): | 2.51e-40 |
| Time: | 19:01:27 | Log-Likelihood: | -1944.0 |
| No. Observations: | 78 | AIC: | 3908. |
| Df Residuals: | 68 | BIC: | 3931. |
| Df Model: | 9 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 2.071e-11 | 4.8e-12 | 4.313 | 0.000 | 1.11e-11 | 3.03e-11 |
| x1 | -5.007e-10 | 1.18e-10 | -4.255 | 0.000 | -7.36e-10 | -2.66e-10 |
| x2 | 8.215e-13 | 1.92e-13 | 4.281 | 0.000 | 4.39e-13 | 1.2e-12 |
| x3 | -7.74e-14 | 1.81e-14 | -4.276 | 0.000 | -1.14e-13 | -4.13e-14 |
| x4 | 2.827e-15 | 6.61e-16 | 4.278 | 0.000 | 1.51e-15 | 4.15e-15 |
| x5 | -3.046e-16 | 7.1e-17 | -4.291 | 0.000 | -4.46e-16 | -1.63e-16 |
| x6 | 5.734e-15 | 1.34e-15 | 4.265 | 0.000 | 3.05e-15 | 8.42e-15 |
| x7 | -7.028e-14 | 1.62e-14 | -4.328 | 0.000 | -1.03e-13 | -3.79e-14 |
| x8 | 2.315e-14 | 5.5e-15 | 4.208 | 0.000 | 1.22e-14 | 3.41e-14 |
| x9 | 4.424e-13 | 1.06e-13 | 4.189 | 0.000 | 2.32e-13 | 6.53e-13 |
| x10 | 6.977e-13 | 1.62e-13 | 4.294 | 0.000 | 3.73e-13 | 1.02e-12 |
| x11 | -1.203e-11 | 2.81e-12 | -4.280 | 0.000 | -1.76e-11 | -6.42e-12 |
| x12 | 9.505e-16 | 2.21e-16 | 4.301 | 0.000 | 5.1e-16 | 1.39e-15 |
| x13 | 1.575e-20 | 3.65e-21 | 4.316 | 0.000 | 8.47e-21 | 2.3e-20 |
| x14 | 1.566e-16 | 3.62e-17 | 4.328 | 0.000 | 8.44e-17 | 2.29e-16 |
| x15 | 1.595e-12 | 3.7e-13 | 4.309 | 0.000 | 8.56e-13 | 2.33e-12 |
| x16 | 2.06e-21 | 4.76e-22 | 4.324 | 0.000 | 1.11e-21 | 3.01e-21 |
| x17 | 2.069e-17 | 4.76e-18 | 4.349 | 0.000 | 1.12e-17 | 3.02e-17 |
| x18 | 1.882e-13 | 4.29e-14 | 4.386 | 0.000 | 1.03e-13 | 2.74e-13 |
| x19 | 4.079e-09 | 9.46e-10 | 4.312 | 0.000 | 2.19e-09 | 5.97e-09 |
| x20 | 1.492e-16 | 3.48e-17 | 4.288 | 0.000 | 7.98e-17 | 2.19e-16 |
| x21 | 6.698e-17 | 1.56e-17 | 4.298 | 0.000 | 3.59e-17 | 9.81e-17 |
| x22 | 3.081e-13 | 7.18e-14 | 4.290 | 0.000 | 1.65e-13 | 4.51e-13 |
| x23 | 1.178e-17 | 2.74e-18 | 4.301 | 0.000 | 6.32e-18 | 1.73e-17 |
| x24 | 1.037e-13 | 2.41e-14 | 4.305 | 0.000 | 5.56e-14 | 1.52e-13 |
| x25 | 3.688e-10 | 8.6e-11 | 4.290 | 0.000 | 1.97e-10 | 5.4e-10 |
| x26 | 1.686e-18 | 3.92e-19 | 4.304 | 0.000 | 9.04e-19 | 2.47e-18 |
| x27 | 1.721e-14 | 3.99e-15 | 4.312 | 0.000 | 9.24e-15 | 2.52e-14 |
| x28 | 9.074e-11 | 2.09e-11 | 4.343 | 0.000 | 4.9e-11 | 1.32e-10 |
| x29 | 5.005e-10 | 1.95e-10 | 2.572 | 0.012 | 1.12e-10 | 8.89e-10 |
| x30 | 2.202e-19 | 5.11e-20 | 4.308 | 0.000 | 1.18e-19 | 3.22e-19 |
| x31 | 2.378e-15 | 5.51e-16 | 4.318 | 0.000 | 1.28e-15 | 3.48e-15 |
| x32 | 1.327e-11 | 3.04e-12 | 4.371 | 0.000 | 7.21e-12 | 1.93e-11 |
| x33 | -1.245e-09 | 1.56e-09 | -0.800 | 0.427 | -4.35e-09 | 1.86e-09 |
| x34 | -6.121e-14 | 1.8e-14 | -3.400 | 0.001 | -9.71e-14 | -2.53e-14 |
| x35 | -8.272e-14 | 1.93e-14 | -4.287 | 0.000 | -1.21e-13 | -4.42e-14 |
| x36 | 3.814e-14 | 8.89e-15 | 4.290 | 0.000 | 2.04e-14 | 5.59e-14 |
| x37 | -5.946e-11 | 1.39e-11 | -4.281 | 0.000 | -8.72e-11 | -3.17e-11 |
| x38 | 8.431e-15 | 1.96e-15 | 4.292 | 0.000 | 4.51e-15 | 1.24e-14 |
| x39 | 6.891e-11 | 1.61e-11 | 4.290 | 0.000 | 3.69e-11 | 1.01e-10 |
| x40 | -7.963e-09 | 1.94e-09 | -4.109 | 0.000 | -1.18e-08 | -4.1e-09 |
| x41 | 1.314e-15 | 3.06e-16 | 4.293 | 0.000 | 7.03e-16 | 1.92e-15 |
| x42 | 1.425e-11 | 3.32e-12 | 4.292 | 0.000 | 7.63e-12 | 2.09e-11 |
| x43 | 8.058e-08 | 1.88e-08 | 4.290 | 0.000 | 4.31e-08 | 1.18e-07 |
| x44 | 4.816e-12 | 1.22e-12 | 3.932 | 0.000 | 2.37e-12 | 7.26e-12 |
| x45 | 1.804e-16 | 4.2e-17 | 4.295 | 0.000 | 9.66e-17 | 2.64e-16 |
| x46 | 2.172e-12 | 5.06e-13 | 4.294 | 0.000 | 1.16e-12 | 3.18e-12 |
| x47 | 1.587e-08 | 3.7e-09 | 4.293 | 0.000 | 8.49e-09 | 2.32e-08 |
| x48 | -5.061e-11 | 1.26e-11 | -4.002 | 0.000 | -7.58e-11 | -2.54e-11 |
| x49 | -8.604e-16 | 2.49e-16 | -3.462 | 0.001 | -1.36e-15 | -3.64e-16 |
| x50 | 2.329e-17 | 5.42e-18 | 4.297 | 0.000 | 1.25e-17 | 3.41e-17 |
| x51 | 2.955e-13 | 6.88e-14 | 4.296 | 0.000 | 1.58e-13 | 4.33e-13 |
| x52 | 2.378e-09 | 5.54e-10 | 4.295 | 0.000 | 1.27e-09 | 3.48e-09 |
| x53 | 9.826e-12 | 1.01e-11 | 0.977 | 0.332 | -1.02e-11 | 2.99e-11 |
| x54 | 7.462e-15 | 2.1e-15 | 3.553 | 0.001 | 3.27e-15 | 1.17e-14 |
| x55 | 2.233e-20 | 1.56e-20 | 1.428 | 0.158 | -8.88e-21 | 5.35e-20 |
| Omnibus: | 0.294 | Durbin-Watson: | 1.067 |
|---|---|---|---|
| Prob(Omnibus): | 0.863 | Jarque-Bera (JB): | 0.300 |
| Skew: | -0.138 | Prob(JB): | 0.861 |
| Kurtosis: | 2.873 | Cond. No. | 2.11e+18 |
Data Sources
Coding inspiration
Interpreting Skewness and Kurtosis
https://medium.com/@atanudan/kurtosis-skew-function-in-pandas-aa63d72e20de
Peter H. Westfall (2014) Kurtosis as Peakedness, 1905–2014. R.I.P., The American Statistician, 68:3, 191-195, DOI: 10.1080/00031305.2014.917055
Smoothing https://www.statsmodels.org/stable/examples/notebooks/generated/exponential_smoothing.html
https://www.statsmodels.org/stable/examples/notebooks/generated/exponential_smoothing.html
https://machinelearningmastery.com/time-series-seasonality-with-python/
https://towardsdatascience.com/holt-winters-exponential-smoothing-d703072c0572
https://www.kaggle.com/kashnitsky/topic-9-part-1-time-series-analysis-in-python
Time Series Decomposition - Level, Trend, Seasonality https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/#:~:text=Trend%3A%20The%20increasing%20or%20decreasing,random%20variation%20in%20the%20series.
Model Evaluation https://towardsdatascience.com/perform-regression-diagnostics-and-tackle-uncertainties-of-linear-models-1372a03b1f56
Bank of England, Real Gross Domestic Product at Market Prices in the United Kingdom [RGDPMPUKQ], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/RGDPMPUKQ, December 9, 2020. https://fred.stlouisfed.org/series/RGDPMPUKQ
GLSAR Regression https://www.statsmodels.org/stable/examples/notebooks/generated/gls.html
Citation Samuel H. Williamson, "What Was the U.K. GDP Then?" MeasuringWorth 2020 URL: http://www.measuringworth.com/ukgdp/
Stephanie Glen. "Variance Inflation Factor" From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/variance-inflation-factor/
Capstone/Modeling Part 1. Univariate Time Series Analysis..ipynb at master · AndreaYoss/Capstone (github.com) Capstone/Modeling Part 2. Multivariate Time Series Analysis..ipynb at master · AndreaYoss/Capstone (github.com)